Erik de Vries
28989f2bc4
dfm_gen: yet another fix for codes
6 years ago
Erik de Vries
0757b6bf8b
dfm_gen: re-added codes variable
6 years ago
Erik de Vries
2fc48cc2f7
dfm_gen: fixed absence of out$codes field
6 years ago
Erik de Vries
b249ff22de
dfm_gen.R: fixed junk mutation
6 years ago
Erik de Vries
0d05765ca7
dfm_gen: removed last remains of summer sample exceptions
6 years ago
Erik de Vries
e199b23227
dfm_gen: removed exceptions for NO summer codes
...
modelizer: created exception for outer_folds = 1
query_string: added parameter for default_operator
6 years ago
Erik de Vries
8051a81b66
actorizer, dfm_gen, modelizer, out_parser: replaced all instances of detectCores by cores parameter (which defaults to detectCores)
6 years ago
Erik de Vries
88fc4ec53c
dfm_gen: changed out_parser call to mamlr:::out_parser
6 years ago
Erik de Vries
ce5f812252
dfm_gen, merger: Added option for generating lemma_upos hybrids for merged field
...
merger: Added custom clean option (sometimes not cleaning is preferred, even with lemmas)
merger, out_parser: Updated regex for filtering out non-words to also include email addresses (containing both @ and .)
6 years ago
Erik de Vries
1955692346
dfm_gen, out_parser: updated documentation
...
dupe_detect: major fix to function, no longer using rownames for article ids
6 years ago
Erik de Vries
34531b0da8
out_parser: added option to clean output using regex to remove numbers and non-words
...
dfm_gen, ud_update: updated functions to make use of out_parser cleaning option
merger: updated regex for cleaning lemmatized output
6 years ago
Erik de Vries
0a3bdb630b
actorizer, dfm_gen, ud_update: unified output parsing from _source and highlight fields into a single function (out_parser)
...
out_parser: function to parse raw text output into a single field, either from _source or highlight fields
dupe_detect: updated function to use 'ver' parameter for versioning
6 years ago
Erik de Vries
9f3418ef37
class_update; dfm_gen; merger: updated functions to accept text parameter for both old style 'lemmas' and new style 'ud'
6 years ago
Erik de Vries
993f39957a
dfm_gen: word cutoff now as final step in script, caused bugs with mutating code variables
6 years ago
Erik de Vries
02b8a8c1da
dfm_gen & merger: Changed word cutoff point to be a general setting in dfm_gen. Cuts off at the last [.?!] before the cutoff point (so returns documents at a sentence, shorter than cutoff).
6 years ago
Erik de Vries
3e66c7e1cd
Updated dfm_gen to have all topic vectors as numeric variables
6 years ago
Erik de Vries
413ad02a87
Set default to "lemmas" for dfm_gen
6 years ago
Erik de Vries
311838b34b
Updated dfm_gen to only create derivative codes if majorTopic actually exists, and set docvars to NULL when no majorTopic codes
6 years ago
Erik de Vries
dc4daf9de4
Added line to replace multiple whitespace characters in full text by a single regular whitespace
6 years ago
Erik de Vries
0e45c0f2d1
Added option for fulltext vs lemmas merged field
6 years ago
Erik de Vries
4bbe84ab83
First release of mamlr package
6 years ago