207 Commits (523d86799c34e8b5e42e9591ef9c7018edf63f76)
 

Author SHA1 Message Date
Erik de Vries 0a3bdb630b actorizer, dfm_gen, ud_update: unified output parsing from _source and highlight fields into a single function (out_parser)
6 years ago
Erik de Vries 9e5a1e3354 ud_update: removed mc.preschedule = F
6 years ago
Erik de Vries c7560d7e32 ud_update: Removed . at end of text, and added mc.preschedule = F for testing
6 years ago
Erik de Vries 37df81b8ff ud_update: fixed merged output field to always contain an (extra) dot (period) at the end of the document
6 years ago
Erik de Vries c32c9e5ad3 ud_update: fix to deal with non-existing column names
6 years ago
Erik de Vries 8ffbddc073 actorizer, ud_update: implemented 'ver' variable for keeping track of updates
6 years ago
Erik de Vries ae23456736 actorizer, ud_update: Updated merging of document fields to properly deal with missing punctuation at the end of fields (e.g. a title without punctuation at the end of the string)
6 years ago
Erik de Vries 9f3418ef37 class_update; dfm_gen; merger: updated functions to accept text parameter for both old style 'lemmas' and new style 'ud'
6 years ago
Erik de Vries 85aab558e0 bulk_writer: added clause to varname==ud update to also remove the tokens variable from source
6 years ago
Erik de Vries 581e7b2929 DESCRIPTION: added SparseM as required package
6 years ago
Erik de Vries 54dfb6a8ca actorizer: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them
6 years ago
Erik de Vries b042fdb1e3 Merge branch 'master' of https://git.thijsdevries.net/edevries/mamlr
6 years ago
Erik de Vries 8caf53b90a actorizer: switched to single core processing for debugging
6 years ago
Erik de Vries e5c87cf69d actorizer: more debug prints
6 years ago
Erik de Vries c63409238b actorizer: print row numbers for debugging
6 years ago
Erik de Vries 39005c7518 elasticizer: Updated bulk size to 1024 (a power of 2) and set a timeout of 900s every 500000 updates
6 years ago
Erik de Vries a3c3651c79 elasticizer: updated scroll time to be longer than the timeouts every 200000 articles (so 20m scroll time, 900s (15m) timeout)
6 years ago
Erik de Vries 4ad5357e15 elasticizer: Added 900s timeout after every batch of 200000 articles when updating, to allow ES to do some segment merges (and clean up disk space)
6 years ago
Erik de Vries a5ba00146f modelizer: fixed error when only one class is predicted for junk classification (borderline case)
6 years ago
Erik de Vries a13d86b92d modelizer: added some more debug output
6 years ago
Erik de Vries 23658ce51a test
6 years ago
Erik de Vries 17cf6d04e9 modelizer: debug update
6 years ago
Erik de Vries 7544e5323f modelizer: update to allow tf both as count (for naive bayes), and as proportion (for other machine learning algorithms)
6 years ago
Erik de Vries 5f5e4a03c8 modelizer: Changed tf-idf weighting from absolute tf count to proportional (normalized) tf! Also added initial support for neural networks
6 years ago
Erik de Vries 34a6adf64e changed udpipe output variable from tokens to ud
6 years ago
Erik de Vries 061da17c2a ud_update: Added function to lemmatize documents
6 years ago
Erik de Vries ef51ce60a9 Fixed dupe_detect error on documents with one sentence or less, and a maximum # of words in dfm_gen
6 years ago
Erik de Vries 0e8c127b86 bulk_writer: fixes for JSON generation and added exception for use of 'tokens' varname
6 years ago
Erik de Vries 755a58d84d dupe_detect: fix to prevent errors when a query returns no results
6 years ago
Erik de Vries 887f1aa774 dupe_detect: fix for empty results dataframe (no duplicates for given date and newspaper)
6 years ago
Erik de Vries 993f39957a dfm_gen: word cutoff now as final step in script, caused bugs with mutating code variables
6 years ago
Erik de Vries 085252abda documentation: updated dupe_detect and merger
6 years ago
Erik de Vries 02b8a8c1da dfm_gen & merger: Changed word cutoff point to be a general setting in dfm_gen. Cuts off at the last [.?!] before the cutoff point (so returns documents at a sentence, shorter than cutoff).
6 years ago
Erik de Vries 4a713ddc23 bulk_writer: setting names(x) <- NULL when there is only one value (list or otherwise) to be updated.
6 years ago
Erik de Vries 6bb8f9b635 class_update: added explicit httr::: references
6 years ago
Erik de Vries f543d658bd Major overhaul to ES bulk update integration. Added support for both setting and appending to variables
6 years ago
Erik de Vries 4adae2bbc6 Fixed bug in dupe_detect caused by switch from cutoff to cutoff_lower/upper
6 years ago
Erik de Vries 4cd46d1a5e dupe_detect: added support for both lower and upper cutoff point
6 years ago
Erik de Vries 11d8b31c60 Added generic actor search query generator. Updated elasticizer and elastic_update to connect either to the remote server, or a local ES instance
6 years ago
Erik de Vries 3e66c7e1cd Updated dfm_gen to have all topic vectors as numeric variables
6 years ago
Erik de Vries 20d7510a89 Merge branch 'master' of https://git.thijsdevries.net/edevries/mamlr
6 years ago
Erik de Vries adc4b3c639 Updated feature selection in modelizer function (see comment on lines 166/167)
6 years ago
Erik de Vries 919e71ac68 Updated feature selection in modelizer function (see comment on lines 166/167)
6 years ago
Erik de Vries 65f8c26ec6 Renamed dupe_detect, and added return output
6 years ago
Erik de Vries db418d7396 Add query_string function for generating query_string queries
6 years ago
Erik de Vries d203de0b2a Updated elasticizer docs, created modelizer and class_update functions
6 years ago
Erik de Vries c815dc7f2b Duplicate detection first commit
6 years ago
Erik de Vries 1f06b0b716 Lowered R version req to 3.3.1
6 years ago
Erik de Vries 015411feaf Added refresh=wait_for to bulk update url. This should make update scripts less demanding on the server side, because the server only replies after refreshing (happens every second)
6 years ago
Erik de Vries 413ad02a87 Set default to "lemmas" for dfm_gen
6 years ago