30 Commits (593d2de6e26eeae3c9b62acb5a6a68e37457d083)

Author SHA1 Message Date
Erik de Vries a1b6c6a7cb actorizer, query_gen_actors: revamped actor searches entirely
6 years ago
Erik de Vries 5b9793cd8c actorizer: removed nested mclapply
6 years ago
Erik de Vries 9b0ac775af class_update: add ver variable to set version for class updated articles
6 years ago
Erik de Vries 85306007f4 class_update: added words and clean parameters, in addition to text parameter, to be able to set data preprocessing exactly the same as in the trained model
6 years ago
Erik de Vries ce5f812252 dfm_gen, merger: Added option for generating lemma_upos hybrids for merged field
6 years ago
Erik de Vries 386ac42aee lemma_writer: new function to write raw lemma's (without interpunction) to text file. Is structured as elasticizer update function (despite not updating anything on the server)
6 years ago
Erik de Vries 1955692346 dfm_gen, out_parser: updated documentation
6 years ago
Erik de Vries 34531b0da8 out_parser: added option to clean output using regex to remove numbers and non-words
6 years ago
Erik de Vries 4f8b1f2024 elasticizer: renamed size parameter to batch_size, created max_batch parameter to limit the number of results returned
6 years ago
Erik de Vries 0a3bdb630b actorizer, dfm_gen, ud_update: unified output parsing from _source and highlight fields into a single function (out_parser)
6 years ago
Erik de Vries 8ffbddc073 actorizer, ud_update: implemented 'ver' variable for keeping track of updates
6 years ago
Erik de Vries ae23456736 actorizer, ud_update: Updated merging of document fields to properly deal with missing punctuation at the end of fields (e.g. a title without punctuation at the end of the string)
6 years ago
Erik de Vries 9f3418ef37 class_update; dfm_gen; merger: updated functions to accept text parameter for both old style 'lemmas' and new style 'ud'
6 years ago
Erik de Vries 39005c7518 elasticizer: Updated bulk size to 1024 (a power of 2) and set a timeout of 900s every 500000 updates
6 years ago
Erik de Vries 061da17c2a ud_update: Added function to lemmatize documents
6 years ago
Erik de Vries ef51ce60a9 Fixed dupe_detect error on documents with one sentence or less, and a maximum # of words in dfm_gen
6 years ago
Erik de Vries 0e8c127b86 bulk_writer: fixes for JSON generation and added exception for use of 'tokens' varname
6 years ago
Erik de Vries 085252abda documentation: updated dupe_detect and merger
6 years ago
Erik de Vries f543d658bd Major overhaul to ES bulk update integration. Added support for both setting and appending to variables
6 years ago
Erik de Vries 4cd46d1a5e dupe_detect: added support for both lower and upper cutoff point
6 years ago
Erik de Vries 11d8b31c60 Added generic actor search query generator. Updated elasticizer and elastic_update to connect either to the remote server, or a local ES instance
6 years ago
Erik de Vries adc4b3c639 Updated feature selection in modelizer function (see comment on lines 166/167)
6 years ago
Erik de Vries 65f8c26ec6 Renamed dupe_detect, and added return output
6 years ago
Erik de Vries db418d7396 Add query_string function for generating query_string queries
6 years ago
Erik de Vries d203de0b2a Updated elasticizer docs, created modelizer and class_update functions
6 years ago
Erik de Vries c815dc7f2b Duplicate detection first commit
6 years ago
Erik de Vries 217ee76568 V 0.1 for elasticizer function with updater support
6 years ago
Erik de Vries a273524105 Added support for custom update function to elasticizer
6 years ago
Erik de Vries 0e45c0f2d1 Added option for fulltext vs lemmas merged field
6 years ago
Erik de Vries 4bbe84ab83 First release of mamlr package
6 years ago