d3d4045f1cactor_aggregation: added sentence count to output, and changed occurences to count instead of mean, changed prom and rel_first to prom_art and rel_first_art, changed output filename to include functionErik de Vries2019-06-17 17:23:09 +0200
176a8f6de4elasticizer: added additional verbosity on errorsErik de Vries2019-06-03 16:22:07 +0200
d420b02c20elasticizer: Added more verbosity to investigate error handlingErik de Vries2019-06-03 16:12:19 +0200
7a01a7f18dquery_gen_actors: temporary update for fixing broken shitErik de Vries2019-05-29 20:31:24 +0200
45da9dd929aggregator_elastic: revert to single-core lapply, due to sendMaster errorsErik de Vries2019-05-29 19:39:50 +0200
f8e4111e70aggregator_elastic: correct partyid implementationErik de Vries2019-05-29 19:33:08 +0200
c047a4a1dbaggregator_elastic: explicit reference to aggregator functionErik de Vries2019-05-29 19:25:56 +0200
0d81d6fc7aadded aggregator and aggregator_elastic functions for aggregating and storing article level actor aggregationsErik de Vries2019-05-29 19:16:13 +0200
2281d11a68actor_aggregation: fixed filenaming of .Rds filesErik de Vries2019-05-25 18:36:27 +0200
a29d04dacdactorizer: fixed handling of empty results due to regex filteringErik de Vries2019-05-20 21:55:07 +0200
8e920f5f37elasticizer: removed idiotic 15min sleep time after 500 batchesErik de Vries2019-05-16 11:58:34 +0200
a11d7728eaactor_aggregation: only aggregate scores on non-junk articlesErik de Vries2019-05-13 11:43:15 +0200
54a70c47a0actor_aggregation: removed timeout for parallel processing, requires fix in elasticizer (cannot recycle the same connection)Erik de Vries2019-05-12 13:13:21 +0200
58fce4d560actor_aggregation: added randomized short sleep, to allow for parallel executionErik de Vries2019-05-12 13:08:09 +0200
e3b26c0be3actor_aggregation: Added function to generate aggregate actor measures at daily, weekly, monthly and yearly level query_string: Added default_operator parameter, to define whether whitespaces should be interpreted as AND or OR, defaults to ANDErik de Vries2019-05-11 17:49:53 +0200
0d05765ca7dfm_gen: removed last remains of summer sample exceptionsErik de Vries2019-05-02 12:46:37 +0200
e199b23227dfm_gen: removed exceptions for NO summer codes modelizer: created exception for outer_folds = 1 query_string: added parameter for default_operatorErik de Vries2019-05-02 12:36:47 +0200
fbd525dc2emodelizer: updated outer cross validation procedure to output raw prediction and true values, instead of processed and aggregated confusion matrix resultsErik de Vries2019-04-30 12:41:38 +0200
6a94bc3ed8query_gen_actors: removed quotation marks from Minister search partErik de Vries2019-04-29 12:21:24 +0200
8d19333e59query_gen_actors: changed script order for belgium exceptionsErik de Vries2019-04-29 11:58:28 +0200
3bfe61e425query_gen_actors: fixed implementation of Belgian exceptionsErik de Vries2019-04-29 11:50:20 +0200
9ca952ca89elastic_update: removed wait_for from url
Erik de Vries
2019-04-25 19:45:44 +0200
8051a81b66actorizer, dfm_gen, modelizer, out_parser: replaced all instances of detectCores by cores parameter (which defaults to detectCores)Erik de Vries2019-04-25 17:25:51 +0200
ac37d836f5elasticizer: added scroll_clear to null hits as wellErik de Vries2019-04-25 17:02:26 +0200
75623856f7elasticizer: updated scroll_clear to use conn objectErik de Vries2019-04-25 16:57:57 +0200
e34460bf0felasticizer: clear scroll context when finishing queryErik de Vries2019-04-25 16:51:03 +0200
9bd526fee0elasticizer: fixed compatibility issues with elastic v1.0.0Erik de Vries2019-04-25 15:03:29 +0200
f2312f65d5elasticizer: update to account for syntax change in newer package versionsErik de Vries2019-04-25 12:33:13 +0200
f6006eb9baactorizer: simplified pre/postfix check, only for NA, replace empty strings by NA beforehandErik de Vries2019-04-25 11:10:25 +0200
298099a4e6actorizer: fix to deal with empty updates (ie dont do an update)Erik de Vries2019-04-24 17:05:20 +0200
6961c0b866query_gen_actors: updated actorid filter to use the keyword subfieldErik de Vries2019-04-24 16:56:56 +0200
703b5e59a4actorizer: fixed exceptionizer by adding whitespace before and after sentence, which is necessary because of negative regex (match anything before or after the highlight string that is NOT x actually requires something to be in front or after)Erik de Vries2019-04-24 15:49:34 +0200
593d2de6e2actorizer: add pre_tags and post_tags to argument list bulk_writer: updated to use _doc doctype query_gen_actors: added NA for all searches that don't have pre- or postfixesErik de Vries2019-04-24 11:57:03 +0200
a1b6c6a7cbactorizer, query_gen_actors: revamped actor searches entirely elasticizer: updated script for use with ES 7.xErik de Vries2019-04-23 16:43:11 +0200
88fc4ec53cdfm_gen: changed out_parser call to mamlr:::out_parserErik de Vries2019-03-20 14:41:53 +0100
90fdbcc982out_parser: parallelized when not in windozeErik de Vries2019-03-04 15:02:09 +0100
6414f759bdactorizer: parallelized calculation of marker positionsErik de Vries2019-03-04 14:27:26 +0100
522c872dbaout_parser: moved cleaning regex to end of pipeline, to prevent collissions with other (mandatory) regex cleaningErik de Vries2019-03-04 14:21:04 +0100
1a4ba19546actorizer: Removed udmodel dependencies, commented code, changed nested lists to flat lists bulk_writer: changed handling of single-row dataframe parsing to JSON elastic_update: changed function to return instead of print appData on error ud_update: Changed nested lists to flat lists, and added start and end character positionsErik de Vries2019-02-26 11:22:14 +0100
3abc3056e0actorizer: fix to columns selected for actors variable, removed udmodel requirementErik de Vries2019-02-20 13:58:51 +0100
41c86ea116actorizer, ud_update: Updated ud parsing and actorizer to work based on character positions. This code is used for local testingErik de Vries2019-02-20 13:42:10 +0100
eae1a22609actorizer: update to use '|||' as highlight indicator, and set up ud output merging accordinglyErik de Vries2019-02-11 16:43:16 +0100
cd05733648actorizer: Additional fix for missing punctuation (see previous commit)Erik de Vries2019-02-05 14:26:28 +0100
09732a1b5aactorizer: quick fix for problem where original UK UD output does not have a dot at the end of the document, but the actor output does (old vs new parsing)Erik de Vries2019-02-05 14:10:27 +0100
835d2332bcactorizer: now uses the original udpipe output for sentence and token ids. When the actorized and original udpipe output do not have the same number of rows, it prints an error and sets err to TRUE in actorDetailsErik de Vries2019-02-05 13:26:24 +0100
e70b6ccf7aactorizer: fixed sentence_count and out_parser calls out_parser: Added comment with old regexErik de Vries2019-02-04 14:16:04 +0100
9b0ac775afclass_update: add ver variable to set version for class updated articlesErik de Vries2019-01-16 19:36:03 +0100
85306007f4class_update: added words and clean parameters, in addition to text parameter, to be able to set data preprocessing exactly the same as in the trained modelErik de Vries2019-01-16 19:34:37 +0100
e110780ad5merger: idiotic fix for a non-problem, see comment on line 32Erik de Vries2019-01-16 19:21:20 +0100
ce5f812252dfm_gen, merger: Added option for generating lemma_upos hybrids for merged field merger: Added custom clean option (sometimes not cleaning is preferred, even with lemmas) merger, out_parser: Updated regex for filtering out non-words to also include email addresses (containing both @ and .)Erik de Vries2019-01-16 18:29:30 +0100
386ac42aeelemma_writer: new function to write raw lemma's (without interpunction) to text file. Is structured as elasticizer update function (despite not updating anything on the server)Erik de Vries2019-01-15 11:36:51 +0100
4407a99774actorizer: fix to get actual number of sentence occurences of actorErik de Vries2019-01-14 17:25:45 +0100
96e869fa6bactorizer: previous commit was wrong, only add is an option, removed type variableErik de Vries2019-01-14 17:06:43 +0100
98219c807cactorizer: Added type option, to choose between setting or adding to the actor variables, defaults to add (should normally not be changed)Erik de Vries2019-01-14 17:05:30 +0100
e3b57ed9e3actorizer: added clean = F to have the exact same behavior in ud_update and actorizerErik de Vries2019-01-14 14:48:04 +0100
b9be372543dupe_detect: fix to get correct colnames from simil (disable stringsAsFactors and convert col values to numeric)Erik de Vries2019-01-11 15:23:18 +0100
1955692346dfm_gen, out_parser: updated documentation dupe_detect: major fix to function, no longer using rownames for article idsErik de Vries2019-01-11 14:45:35 +0100
34531b0da8out_parser: added option to clean output using regex to remove numbers and non-words dfm_gen, ud_update: updated functions to make use of out_parser cleaning option merger: updated regex for cleaning lemmatized outputErik de Vries2019-01-11 13:59:19 +0100
5851c56369query_string: updated check for fields valueErik de Vries2019-01-09 14:04:38 +0100
4f8b1f2024elasticizer: renamed size parameter to batch_size, created max_batch parameter to limit the number of results returned query_string: renamed x parameter to query, added fields parameter to select what fields to return and random boolean parameter to define whether the returned results should be randomizedErik de Vries2019-01-09 13:52:51 +0100
d0e9bf565bdupe_detect: Reset the _delete value to 1 out_parser: fix to sentence parsing, add additional (empty) string at end of merged field, to make merged field end on .Erik de Vries2019-01-08 15:07:40 +0100
ea8cfb071fdupe_detect: updated _delete var to be 2 when delete is trueErik de Vries2019-01-03 23:35:28 +0100
0a3bdb630bactorizer, dfm_gen, ud_update: unified output parsing from _source and highlight fields into a single function (out_parser) out_parser: function to parse raw text output into a single field, either from _source or highlight fields dupe_detect: updated function to use 'ver' parameter for versioningErik de Vries2019-01-02 18:11:34 +0100
c7560d7e32ud_update: Removed . at end of text, and added mc.preschedule = F for testingErik de Vries2018-12-30 20:34:42 +0100
37df81b8ffud_update: fixed merged output field to always contain an (extra) dot (period) at the end of the documentErik de Vries2018-12-30 20:20:38 +0100
c32c9e5ad3ud_update: fix to deal with non-existing column namesErik de Vries2018-12-30 19:34:48 +0100
8ffbddc073actorizer, ud_update: implemented 'ver' variable for keeping track of updatesErik de Vries2018-12-30 19:10:59 +0100
ae23456736actorizer, ud_update: Updated merging of document fields to properly deal with missing punctuation at the end of fields (e.g. a title without punctuation at the end of the string)Erik de Vries2018-12-30 19:02:17 +0100
9f3418ef37class_update; dfm_gen; merger: updated functions to accept text parameter for both old style 'lemmas' and new style 'ud'Erik de Vries2018-12-15 13:02:22 +0100
85aab558e0bulk_writer: added clause to varname==ud update to also remove the tokens variable from sourceErik de Vries2018-12-13 21:24:29 +0100
581e7b2929DESCRIPTION: added SparseM as required packageErik de Vries2018-12-13 14:05:09 +0100
54dfb6a8caactorizer: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them ud_update: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them elastic_update: set the minimum break between retries from 10 to 30 seconds elasticizer: implementation of retries for elasticizer function, 10 retries with a break of 30 seconds in betweenErik de Vries2018-12-13 13:40:03 +0100
39005c7518elasticizer: Updated bulk size to 1024 (a power of 2) and set a timeout of 900s every 500000 updates query_gen_actors: Added an additional generator for the "Institution" type (for EU support) actorizer: Created an updater function to search for actors and use UDPipe to parse the resultsErik de Vries2018-12-12 19:01:10 +0100
a3c3651c79elasticizer: updated scroll time to be longer than the timeouts every 200000 articles (so 20m scroll time, 900s (15m) timeout)Erik de Vries2018-12-11 11:55:14 +0100
4ad5357e15elasticizer: Added 900s timeout after every batch of 200000 articles when updating, to allow ES to do some segment merges (and clean up disk space)
DKJunk
Erik de Vries2018-12-11 11:02:19 +0100
a5ba00146fmodelizer: fixed error when only one class is predicted for junk classification (borderline case)Erik de Vries2018-12-09 13:41:34 +0100
7544e5323fmodelizer: update to allow tf both as count (for naive bayes), and as proportion (for other machine learning algorithms)Erik de Vries2018-12-08 17:44:05 +0100
5f5e4a03c8modelizer: Changed tf-idf weighting from absolute tf count to proportional (normalized) tf! Also added initial support for neural networksErik de Vries2018-12-07 14:35:50 +0100
34a6adf64echanged udpipe output variable from tokens to udErik de Vries2018-12-05 16:51:59 +0100
061da17c2aud_update: Added function to lemmatize documentsErik de Vries2018-12-05 15:02:41 +0100
ef51ce60a9Fixed dupe_detect error on documents with one sentence or less, and a maximum # of words in dfm_gen
DupeDetect
Erik de Vries2018-12-04 17:44:43 +0100