#' Get ids of duplicate documents that have a cosine similarity score higher than [threshold]
#' Get ids of duplicate documents that have a cosine similarity score higher than [threshold]
#' @param row Row of grid to parse
#' @param row Row of grid to parse
#' @param grid A cross-table of all possible combinations of doctypes and dates
#' @param grid A cross-table of all possible combinations of doctypes and dates
#' @param cutoff Cutoff value for cosine similarity above which documents are considered duplicates
#' @param cutoff_lower Cutoff value for minimum cosine similarity above which documents are considered duplicates (inclusive)
#' @param cutoff_upper Cutoff value for maximum cosine similarity, above which documents are not considered duplicates (for debugging and manual parameter tuning, inclusive)
#' @param es_pwd Password for Elasticsearch read access
#' @param es_pwd Password for Elasticsearch read access
#' @return dupe_objects.json and data frame containing each id and all its duplicates. remove_ids.txt and character vector with list of ids to be removed. Files are in current working directory
#' @return dupe_objects.json and data frame containing each id and all its duplicates. remove_ids.txt and character vector with list of ids to be removed. Files are in current working directory
\item{grid}{A cross-table of all possible combinations of doctypes and dates}
\item{grid}{A cross-table of all possible combinations of doctypes and dates}
\item{cutoff}{Cutoff value for cosine similarity above which documents are considered duplicates}
\item{cutoff_lower}{Cutoff value for minimum cosine similarity above which documents are considered duplicates (inclusive)}
\item{cutoff_upper}{Cutoff value for maximum cosine similarity, above which documents are not considered duplicates (for debugging and manual parameter tuning, inclusive)}
\item{es_pwd}{Password for Elasticsearch read access}
\item{es_pwd}{Password for Elasticsearch read access}
}
}
@ -22,5 +24,5 @@ dupe_objects.json and data frame containing each id and all its duplicates. remo
Get ids of duplicate documents that have a cosine similarity score higher than [threshold]
Get ids of duplicate documents that have a cosine similarity score higher than [threshold]