Erik de Vries
90fdbcc982
out_parser: parallelized when not in windoze
6 years ago
Erik de Vries
522c872dba
out_parser: moved cleaning regex to end of pipeline, to prevent collissions with other (mandatory) regex cleaning
6 years ago
Erik de Vries
e70b6ccf7a
actorizer: fixed sentence_count and out_parser calls
...
out_parser: Added comment with old regex
6 years ago
Erik de Vries
ce5f812252
dfm_gen, merger: Added option for generating lemma_upos hybrids for merged field
...
merger: Added custom clean option (sometimes not cleaning is preferred, even with lemmas)
merger, out_parser: Updated regex for filtering out non-words to also include email addresses (containing both @ and .)
6 years ago
Erik de Vries
1955692346
dfm_gen, out_parser: updated documentation
...
dupe_detect: major fix to function, no longer using rownames for article ids
6 years ago
Erik de Vries
34531b0da8
out_parser: added option to clean output using regex to remove numbers and non-words
...
dfm_gen, ud_update: updated functions to make use of out_parser cleaning option
merger: updated regex for cleaning lemmatized output
6 years ago
Erik de Vries
d0e9bf565b
dupe_detect: Reset the _delete value to 1
...
out_parser: fix to sentence parsing, add additional (empty) string at end of merged field, to make merged field end on .
6 years ago
Erik de Vries
0a3bdb630b
actorizer, dfm_gen, ud_update: unified output parsing from _source and highlight fields into a single function (out_parser)
...
out_parser: function to parse raw text output into a single field, either from _source or highlight fields
dupe_detect: updated function to use 'ver' parameter for versioning
6 years ago