mamlr/man/dupe_detect.Rd

% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/dupe_detect.R
\name{dupe_detect}
\alias{dupe_detect}
\title{Get ids of duplicate documents that have a cosine similarity score higher than [threshold]}
\usage{
dupe_detect(
  row,
  grid,
  cutoff_lower,
  cutoff_upper = 1,
  es_pwd,
  es_super,
  words,
  localhost = T,
  ver
)
}
\arguments{
\item{row}{Row of grid to parse}

\item{grid}{A cross-table of all possible combinations of doctypes and dates}

\item{cutoff_lower}{Cutoff value for minimum cosine similarity above which documents are considered duplicates (inclusive)}

\item{cutoff_upper}{Cutoff value for maximum cosine similarity, above which documents are not considered duplicates (for debugging and manual parameter tuning, inclusive)}

\item{es_pwd}{Password for Elasticsearch read access}

\item{es_super}{Password for write access to ElasticSearch}

\item{words}{Document cutoff point in number of words. Documents are cut off at the last [.?!] before the cutoff (so document will be a little shorter than [words])}

\item{localhost}{Defaults to true. When true, connect to a local Elasticsearch instance on the default port (9200)}

\item{ver}{Short string (preferably a single word/sequence) indicating the version of the updated document (i.e. for a udpipe update this string might be 'udV2')}
}
\value{
dupe_objects.json and data frame containing each id and all its duplicates. remove_ids.txt and character vector with list of ids to be removed. Files are in current working directory
}
\description{
Get ids of duplicate documents that have a cosine similarity score higher than [threshold]
}
\examples{
dupe_detect(1,grid,cutoff_lower, cutoff_upper = 1, es_pwd, es_super, words, localhost = T)
}
Duplicate detection first commit 6 years ago			`% Generated by roxygen2: do not edit by hand`
Updated feature selection in modelizer function (see comment on lines 166/167) 6 years ago			`% Please edit documentation in R/dupe_detect.R`
Duplicate detection first commit 6 years ago			`\name{dupe_detect}`
			`\alias{dupe_detect}`
			`\title{Get ids of duplicate documents that have a cosine similarity score higher than [threshold]}`
			`\usage{`
actor_merger: added function for generating actor-document data frames actor_fetcher: removed from package other: major update to documentation 5 years ago			`dupe_detect(`
			`row,`
			`grid,`
			`cutoff_lower,`
			`cutoff_upper = 1,`
			`es_pwd,`
			`es_super,`
			`words,`
			`localhost = T,`
			`ver`
			`)`
Duplicate detection first commit 6 years ago			`}`
			`\arguments{`
			`\item{row}{Row of grid to parse}`

			`\item{grid}{A cross-table of all possible combinations of doctypes and dates}`

dupe_detect: added support for both lower and upper cutoff point 6 years ago			`\item{cutoff_lower}{Cutoff value for minimum cosine similarity above which documents are considered duplicates (inclusive)}`

			`\item{cutoff_upper}{Cutoff value for maximum cosine similarity, above which documents are not considered duplicates (for debugging and manual parameter tuning, inclusive)}`
Duplicate detection first commit 6 years ago
			`\item{es_pwd}{Password for Elasticsearch read access}`
documentation: updated dupe_detect and merger 6 years ago
Fixed dupe_detect error on documents with one sentence or less, and a maximum # of words in dfm_gen 6 years ago			`\item{es_super}{Password for write access to ElasticSearch}`

documentation: updated dupe_detect and merger 6 years ago			`\item{words}{Document cutoff point in number of words. Documents are cut off at the last [.?!] before the cutoff (so document will be a little shorter than [words])}`
Fixed dupe_detect error on documents with one sentence or less, and a maximum # of words in dfm_gen 6 years ago
			`\item{localhost}{Defaults to true. When true, connect to a local Elasticsearch instance on the default port (9200)}`
elasticizer: renamed size parameter to batch_size, created max_batch parameter to limit the number of results returned query_string: renamed x parameter to query, added fields parameter to select what fields to return and random boolean parameter to define whether the returned results should be randomized 6 years ago
			`\item{ver}{Short string (preferably a single word/sequence) indicating the version of the updated document (i.e. for a udpipe update this string might be 'udV2')}`
Duplicate detection first commit 6 years ago			`}`
			`\value{`
Renamed dupe_detect, and added return output 6 years ago			`dupe_objects.json and data frame containing each id and all its duplicates. remove_ids.txt and character vector with list of ids to be removed. Files are in current working directory`
Duplicate detection first commit 6 years ago			`}`
			`\description{`
			`Get ids of duplicate documents that have a cosine similarity score higher than [threshold]`
			`}`
			`\examples{`
Fixed dupe_detect error on documents with one sentence or less, and a maximum # of words in dfm_gen 6 years ago			`dupe_detect(1,grid,cutoff_lower, cutoff_upper = 1, es_pwd, es_super, words, localhost = T)`
Duplicate detection first commit 6 years ago			`}`