14 Commits (9e433ecf9e0a323b2596aa290d7e53cee3f2aadd)

Author SHA1 Message Date
Erik de Vries 7218f6b8d0 dupe_detect: fixed error on no duplicates
6 years ago
Erik de Vries b9be372543 dupe_detect: fix to get correct colnames from simil (disable stringsAsFactors and convert col values to numeric)
6 years ago
Erik de Vries 1955692346 dfm_gen, out_parser: updated documentation
6 years ago
Erik de Vries d0e9bf565b dupe_detect: Reset the _delete value to 1
6 years ago
Erik de Vries ea8cfb071f dupe_detect: updated _delete var to be 2 when delete is true
6 years ago
Erik de Vries 0a3bdb630b actorizer, dfm_gen, ud_update: unified output parsing from _source and highlight fields into a single function (out_parser)
6 years ago
Erik de Vries ef51ce60a9 Fixed dupe_detect error on documents with one sentence or less, and a maximum # of words in dfm_gen
6 years ago
Erik de Vries 0e8c127b86 bulk_writer: fixes for JSON generation and added exception for use of 'tokens' varname
6 years ago
Erik de Vries 755a58d84d dupe_detect: fix to prevent errors when a query returns no results
6 years ago
Erik de Vries 887f1aa774 dupe_detect: fix for empty results dataframe (no duplicates for given date and newspaper)
6 years ago
Erik de Vries 02b8a8c1da dfm_gen & merger: Changed word cutoff point to be a general setting in dfm_gen. Cuts off at the last [.?!] before the cutoff point (so returns documents at a sentence, shorter than cutoff).
6 years ago
Erik de Vries 4adae2bbc6 Fixed bug in dupe_detect caused by switch from cutoff to cutoff_lower/upper
6 years ago
Erik de Vries 4cd46d1a5e dupe_detect: added support for both lower and upper cutoff point
6 years ago
Erik de Vries 65f8c26ec6 Renamed dupe_detect, and added return output
6 years ago