Erik de Vries
7218f6b8d0
dupe_detect: fixed error on no duplicates
6 years ago
Erik de Vries
b9be372543
dupe_detect: fix to get correct colnames from simil (disable stringsAsFactors and convert col values to numeric)
6 years ago
Erik de Vries
1955692346
dfm_gen, out_parser: updated documentation
...
dupe_detect: major fix to function, no longer using rownames for article ids
6 years ago
Erik de Vries
d0e9bf565b
dupe_detect: Reset the _delete value to 1
...
out_parser: fix to sentence parsing, add additional (empty) string at end of merged field, to make merged field end on .
6 years ago
Erik de Vries
ea8cfb071f
dupe_detect: updated _delete var to be 2 when delete is true
6 years ago
Erik de Vries
0a3bdb630b
actorizer, dfm_gen, ud_update: unified output parsing from _source and highlight fields into a single function (out_parser)
...
out_parser: function to parse raw text output into a single field, either from _source or highlight fields
dupe_detect: updated function to use 'ver' parameter for versioning
6 years ago
Erik de Vries
ef51ce60a9
Fixed dupe_detect error on documents with one sentence or less, and a maximum # of words in dfm_gen
6 years ago
Erik de Vries
0e8c127b86
bulk_writer: fixes for JSON generation and added exception for use of 'tokens' varname
...
class_update/elastic_update: Moved response checking to elastic_update
dupe_detect: Finalized dupe_detect
6 years ago
Erik de Vries
755a58d84d
dupe_detect: fix to prevent errors when a query returns no results
6 years ago
Erik de Vries
887f1aa774
dupe_detect: fix for empty results dataframe (no duplicates for given date and newspaper)
6 years ago
Erik de Vries
02b8a8c1da
dfm_gen & merger: Changed word cutoff point to be a general setting in dfm_gen. Cuts off at the last [.?!] before the cutoff point (so returns documents at a sentence, shorter than cutoff).
6 years ago
Erik de Vries
4adae2bbc6
Fixed bug in dupe_detect caused by switch from cutoff to cutoff_lower/upper
6 years ago
Erik de Vries
4cd46d1a5e
dupe_detect: added support for both lower and upper cutoff point
6 years ago
Erik de Vries
65f8c26ec6
Renamed dupe_detect, and added return output
6 years ago