136 Commits (bf3d11ffe059c59055f5babd028fd942ae630c0e)

Author SHA1 Message Date
Erik de Vries f6006eb9ba actorizer: simplified pre/postfix check, only for NA, replace empty strings by NA beforehand
6 years ago
Erik de Vries 298099a4e6 actorizer: fix to deal with empty updates (ie dont do an update)
6 years ago
Erik de Vries 6961c0b866 query_gen_actors: updated actorid filter to use the keyword subfield
6 years ago
Erik de Vries 703b5e59a4 actorizer: fixed exceptionizer by adding whitespace before and after sentence, which is necessary because of negative regex (match anything before or after the highlight string that is NOT x actually requires something to be in front or after)
6 years ago
Erik de Vries 593d2de6e2 actorizer: add pre_tags and post_tags to argument list
6 years ago
Erik de Vries a1b6c6a7cb actorizer, query_gen_actors: revamped actor searches entirely
6 years ago
Erik de Vries 88fc4ec53c dfm_gen: changed out_parser call to mamlr:::out_parser
6 years ago
Erik de Vries 90fdbcc982 out_parser: parallelized when not in windoze
6 years ago
Erik de Vries 6414f759bd actorizer: parallelized calculation of marker positions
6 years ago
Erik de Vries 522c872dba out_parser: moved cleaning regex to end of pipeline, to prevent collissions with other (mandatory) regex cleaning
6 years ago
Erik de Vries 5b9793cd8c actorizer: removed nested mclapply
6 years ago
Erik de Vries 1a4ba19546 actorizer: Removed udmodel dependencies, commented code, changed nested lists to flat lists
6 years ago
Erik de Vries 3abc3056e0 actorizer: fix to columns selected for actors variable, removed udmodel requirement
6 years ago
Erik de Vries 41c86ea116 actorizer, ud_update: Updated ud parsing and actorizer to work based on character positions. This code is used for local testing
6 years ago
Erik de Vries eae1a22609 actorizer: update to use '|||' as highlight indicator, and set up ud output merging accordingly
6 years ago
Erik de Vries 5665b6d622 actorizer: more fixes to punctuation
6 years ago
Erik de Vries cd05733648 actorizer: Additional fix for missing punctuation (see previous commit)
6 years ago
Erik de Vries 09732a1b5a actorizer: quick fix for problem where original UK UD output does not have a dot at the end of the document, but the actor output does (old vs new parsing)
6 years ago
Erik de Vries 835d2332bc actorizer: now uses the original udpipe output for sentence and token ids. When the actorized and original udpipe output do not have the same number of rows, it prints an error and sets err to TRUE in actorDetails
6 years ago
Erik de Vries e70b6ccf7a actorizer: fixed sentence_count and out_parser calls
6 years ago
Erik de Vries 9b0ac775af class_update: add ver variable to set version for class updated articles
6 years ago
Erik de Vries 85306007f4 class_update: added words and clean parameters, in addition to text parameter, to be able to set data preprocessing exactly the same as in the trained model
6 years ago
Erik de Vries e110780ad5 merger: idiotic fix for a non-problem, see comment on line 32
6 years ago
Erik de Vries ce5f812252 dfm_gen, merger: Added option for generating lemma_upos hybrids for merged field
6 years ago
Erik de Vries 386ac42aee lemma_writer: new function to write raw lemma's (without interpunction) to text file. Is structured as elasticizer update function (despite not updating anything on the server)
6 years ago
Erik de Vries 4407a99774 actorizer: fix to get actual number of sentence occurences of actor
6 years ago
Erik de Vries 96e869fa6b actorizer: previous commit was wrong, only add is an option, removed type variable
6 years ago
Erik de Vries 98219c807c actorizer: Added type option, to choose between setting or adding to the actor variables, defaults to add (should normally not be changed)
6 years ago
Erik de Vries e3b57ed9e3 actorizer: added clean = F to have the exact same behavior in ud_update and actorizer
6 years ago
Erik de Vries 7218f6b8d0 dupe_detect: fixed error on no duplicates
6 years ago
Erik de Vries b9be372543 dupe_detect: fix to get correct colnames from simil (disable stringsAsFactors and convert col values to numeric)
6 years ago
Erik de Vries 1955692346 dfm_gen, out_parser: updated documentation
6 years ago
Erik de Vries 34531b0da8 out_parser: added option to clean output using regex to remove numbers and non-words
6 years ago
Erik de Vries 5851c56369 query_string: updated check for fields value
6 years ago
Erik de Vries 4f8b1f2024 elasticizer: renamed size parameter to batch_size, created max_batch parameter to limit the number of results returned
6 years ago
Erik de Vries d0e9bf565b dupe_detect: Reset the _delete value to 1
6 years ago
Erik de Vries ea8cfb071f dupe_detect: updated _delete var to be 2 when delete is true
6 years ago
Erik de Vries 0a3bdb630b actorizer, dfm_gen, ud_update: unified output parsing from _source and highlight fields into a single function (out_parser)
6 years ago
Erik de Vries 9e5a1e3354 ud_update: removed mc.preschedule = F
6 years ago
Erik de Vries c7560d7e32 ud_update: Removed . at end of text, and added mc.preschedule = F for testing
6 years ago
Erik de Vries 37df81b8ff ud_update: fixed merged output field to always contain an (extra) dot (period) at the end of the document
6 years ago
Erik de Vries c32c9e5ad3 ud_update: fix to deal with non-existing column names
6 years ago
Erik de Vries 8ffbddc073 actorizer, ud_update: implemented 'ver' variable for keeping track of updates
6 years ago
Erik de Vries ae23456736 actorizer, ud_update: Updated merging of document fields to properly deal with missing punctuation at the end of fields (e.g. a title without punctuation at the end of the string)
6 years ago
Erik de Vries 9f3418ef37 class_update; dfm_gen; merger: updated functions to accept text parameter for both old style 'lemmas' and new style 'ud'
6 years ago
Erik de Vries 85aab558e0 bulk_writer: added clause to varname==ud update to also remove the tokens variable from source
6 years ago
Erik de Vries 54dfb6a8ca actorizer: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them
6 years ago
Erik de Vries 8caf53b90a actorizer: switched to single core processing for debugging
6 years ago
Erik de Vries c63409238b actorizer: print row numbers for debugging
6 years ago
Erik de Vries 39005c7518 elasticizer: Updated bulk size to 1024 (a power of 2) and set a timeout of 900s every 500000 updates
6 years ago
Erik de Vries a3c3651c79 elasticizer: updated scroll time to be longer than the timeouts every 200000 articles (so 20m scroll time, 900s (15m) timeout)
6 years ago
Erik de Vries 4ad5357e15 elasticizer: Added 900s timeout after every batch of 200000 articles when updating, to allow ES to do some segment merges (and clean up disk space)
6 years ago
Erik de Vries a5ba00146f modelizer: fixed error when only one class is predicted for junk classification (borderline case)
6 years ago
Erik de Vries a13d86b92d modelizer: added some more debug output
6 years ago
Erik de Vries 23658ce51a test
6 years ago
Erik de Vries 17cf6d04e9 modelizer: debug update
6 years ago
Erik de Vries 7544e5323f modelizer: update to allow tf both as count (for naive bayes), and as proportion (for other machine learning algorithms)
6 years ago
Erik de Vries 5f5e4a03c8 modelizer: Changed tf-idf weighting from absolute tf count to proportional (normalized) tf! Also added initial support for neural networks
6 years ago
Erik de Vries 34a6adf64e changed udpipe output variable from tokens to ud
6 years ago
Erik de Vries 061da17c2a ud_update: Added function to lemmatize documents
6 years ago
Erik de Vries ef51ce60a9 Fixed dupe_detect error on documents with one sentence or less, and a maximum # of words in dfm_gen
6 years ago
Erik de Vries 0e8c127b86 bulk_writer: fixes for JSON generation and added exception for use of 'tokens' varname
6 years ago
Erik de Vries 755a58d84d dupe_detect: fix to prevent errors when a query returns no results
6 years ago
Erik de Vries 887f1aa774 dupe_detect: fix for empty results dataframe (no duplicates for given date and newspaper)
6 years ago
Erik de Vries 993f39957a dfm_gen: word cutoff now as final step in script, caused bugs with mutating code variables
6 years ago
Erik de Vries 02b8a8c1da dfm_gen & merger: Changed word cutoff point to be a general setting in dfm_gen. Cuts off at the last [.?!] before the cutoff point (so returns documents at a sentence, shorter than cutoff).
6 years ago
Erik de Vries 4a713ddc23 bulk_writer: setting names(x) <- NULL when there is only one value (list or otherwise) to be updated.
6 years ago
Erik de Vries 6bb8f9b635 class_update: added explicit httr::: references
6 years ago
Erik de Vries f543d658bd Major overhaul to ES bulk update integration. Added support for both setting and appending to variables
6 years ago
Erik de Vries 4adae2bbc6 Fixed bug in dupe_detect caused by switch from cutoff to cutoff_lower/upper
6 years ago
Erik de Vries 4cd46d1a5e dupe_detect: added support for both lower and upper cutoff point
6 years ago
Erik de Vries 11d8b31c60 Added generic actor search query generator. Updated elasticizer and elastic_update to connect either to the remote server, or a local ES instance
6 years ago
Erik de Vries 3e66c7e1cd Updated dfm_gen to have all topic vectors as numeric variables
6 years ago
Erik de Vries adc4b3c639 Updated feature selection in modelizer function (see comment on lines 166/167)
6 years ago
Erik de Vries 65f8c26ec6 Renamed dupe_detect, and added return output
6 years ago
Erik de Vries db418d7396 Add query_string function for generating query_string queries
6 years ago
Erik de Vries d203de0b2a Updated elasticizer docs, created modelizer and class_update functions
6 years ago
Erik de Vries c815dc7f2b Duplicate detection first commit
6 years ago
Erik de Vries 015411feaf Added refresh=wait_for to bulk update url. This should make update scripts less demanding on the server side, because the server only replies after refreshing (happens every second)
6 years ago
Erik de Vries 413ad02a87 Set default to "lemmas" for dfm_gen
6 years ago
Erik de Vries 217ee76568 V 0.1 for elasticizer function with updater support
6 years ago
Erik de Vries a273524105 Added support for custom update function to elasticizer
6 years ago
Erik de Vries 311838b34b Updated dfm_gen to only create derivative codes if majorTopic actually exists, and set docvars to NULL when no majorTopic codes
6 years ago
Erik de Vries dc4daf9de4 Added line to replace multiple whitespace characters in full text by a single regular whitespace
6 years ago
Erik de Vries 0e45c0f2d1 Added option for fulltext vs lemmas merged field
6 years ago
Erik de Vries 4bbe84ab83 First release of mamlr package
6 years ago