Commit Graph

  • d3d4045f1c actor_aggregation: added sentence count to output, and changed occurences to count instead of mean, changed prom and rel_first to prom_art and rel_first_art, changed output filename to include function Erik de Vries 2019-06-17 17:23:09 +0200
  • 176a8f6de4 elasticizer: added additional verbosity on errors Erik de Vries 2019-06-03 16:22:07 +0200
  • d420b02c20 elasticizer: Added more verbosity to investigate error handling Erik de Vries 2019-06-03 16:12:19 +0200
  • 48b589dda0 query_gen_actors: reset to original state Erik de Vries 2019-05-29 21:12:22 +0200
  • 7a01a7f18d query_gen_actors: temporary update for fixing broken shit Erik de Vries 2019-05-29 20:31:24 +0200
  • 45da9dd929 aggregator_elastic: revert to single-core lapply, due to sendMaster errors Erik de Vries 2019-05-29 19:39:50 +0200
  • f8e4111e70 aggregator_elastic: correct partyid implementation Erik de Vries 2019-05-29 19:33:08 +0200
  • c047a4a1db aggregator_elastic: explicit reference to aggregator function Erik de Vries 2019-05-29 19:25:56 +0200
  • 0d81d6fc7a added aggregator and aggregator_elastic functions for aggregating and storing article level actor aggregations Erik de Vries 2019-05-29 19:16:13 +0200
  • 2281d11a68 actor_aggregation: fixed filenaming of .Rds files Erik de Vries 2019-05-25 18:36:27 +0200
  • d9f28a46d8 actor_aggregation: small fixes to code Erik de Vries 2019-05-25 14:05:49 +0200
  • a29d04dacd actorizer: fixed handling of empty results due to regex filtering Erik de Vries 2019-05-20 21:55:07 +0200
  • 8e920f5f37 elasticizer: removed idiotic 15min sleep time after 500 batches Erik de Vries 2019-05-16 11:58:34 +0200
  • a11d7728ea actor_aggregation: only aggregate scores on non-junk articles Erik de Vries 2019-05-13 11:43:15 +0200
  • 54a70c47a0 actor_aggregation: removed timeout for parallel processing, requires fix in elasticizer (cannot recycle the same connection) Erik de Vries 2019-05-12 13:13:21 +0200
  • 58fce4d560 actor_aggregation: added randomized short sleep, to allow for parallel execution Erik de Vries 2019-05-12 13:08:09 +0200
  • e3b26c0be3 actor_aggregation: Added function to generate aggregate actor measures at daily, weekly, monthly and yearly level query_string: Added default_operator parameter, to define whether whitespaces should be interpreted as AND or OR, defaults to AND Erik de Vries 2019-05-11 17:49:53 +0200
  • 28989f2bc4 dfm_gen: yet another fix for codes Erik de Vries 2019-05-02 13:21:22 +0200
  • 0757b6bf8b dfm_gen: re-added codes variable Erik de Vries 2019-05-02 13:15:56 +0200
  • 2fc48cc2f7 dfm_gen: fixed absence of out$codes field Erik de Vries 2019-05-02 13:10:32 +0200
  • b249ff22de dfm_gen.R: fixed junk mutation Erik de Vries 2019-05-02 13:04:13 +0200
  • 0d05765ca7 dfm_gen: removed last remains of summer sample exceptions Erik de Vries 2019-05-02 12:46:37 +0200
  • e199b23227 dfm_gen: removed exceptions for NO summer codes modelizer: created exception for outer_folds = 1 query_string: added parameter for default_operator Erik de Vries 2019-05-02 12:36:47 +0200
  • fbd525dc2e modelizer: updated outer cross validation procedure to output raw prediction and true values, instead of processed and aggregated confusion matrix results Erik de Vries 2019-04-30 12:41:38 +0200
  • 6a94bc3ed8 query_gen_actors: removed quotation marks from Minister search part Erik de Vries 2019-04-29 12:21:24 +0200
  • 8d19333e59 query_gen_actors: changed script order for belgium exceptions Erik de Vries 2019-04-29 11:58:28 +0200
  • 3bfe61e425 query_gen_actors: fixed implementation of Belgian exceptions Erik de Vries 2019-04-29 11:50:20 +0200
  • 81697345cb modelizer: removed breaking code Erik de Vries 2019-04-26 12:34:09 +0200
  • 9ca952ca89 elastic_update: removed wait_for from url Erik de Vries 2019-04-25 19:45:44 +0200
  • 8051a81b66 actorizer, dfm_gen, modelizer, out_parser: replaced all instances of detectCores by cores parameter (which defaults to detectCores) Erik de Vries 2019-04-25 17:25:51 +0200
  • ac37d836f5 elasticizer: added scroll_clear to null hits as well Erik de Vries 2019-04-25 17:02:26 +0200
  • 75623856f7 elasticizer: updated scroll_clear to use conn object Erik de Vries 2019-04-25 16:57:57 +0200
  • c2d666c81d bogus commit Erik de Vries 2019-04-25 16:54:01 +0200
  • e34460bf0f elasticizer: clear scroll context when finishing query Erik de Vries 2019-04-25 16:51:03 +0200
  • 9bd526fee0 elasticizer: fixed compatibility issues with elastic v1.0.0 Erik de Vries 2019-04-25 15:03:29 +0200
  • f2312f65d5 elasticizer: update to account for syntax change in newer package versions Erik de Vries 2019-04-25 12:33:13 +0200
  • f6006eb9ba actorizer: simplified pre/postfix check, only for NA, replace empty strings by NA beforehand Erik de Vries 2019-04-25 11:10:25 +0200
  • 298099a4e6 actorizer: fix to deal with empty updates (ie dont do an update) Erik de Vries 2019-04-24 17:05:20 +0200
  • 6961c0b866 query_gen_actors: updated actorid filter to use the keyword subfield Erik de Vries 2019-04-24 16:56:56 +0200
  • 703b5e59a4 actorizer: fixed exceptionizer by adding whitespace before and after sentence, which is necessary because of negative regex (match anything before or after the highlight string that is NOT x actually requires something to be in front or after) Erik de Vries 2019-04-24 15:49:34 +0200
  • 593d2de6e2 actorizer: add pre_tags and post_tags to argument list bulk_writer: updated to use _doc doctype query_gen_actors: added NA for all searches that don't have pre- or postfixes Erik de Vries 2019-04-24 11:57:03 +0200
  • a1b6c6a7cb actorizer, query_gen_actors: revamped actor searches entirely elasticizer: updated script for use with ES 7.x Erik de Vries 2019-04-23 16:43:11 +0200
  • 88fc4ec53c dfm_gen: changed out_parser call to mamlr:::out_parser Erik de Vries 2019-03-20 14:41:53 +0100
  • 90fdbcc982 out_parser: parallelized when not in windoze Erik de Vries 2019-03-04 15:02:09 +0100
  • 6414f759bd actorizer: parallelized calculation of marker positions Erik de Vries 2019-03-04 14:27:26 +0100
  • 522c872dba out_parser: moved cleaning regex to end of pipeline, to prevent collissions with other (mandatory) regex cleaning Erik de Vries 2019-03-04 14:21:04 +0100
  • 5b9793cd8c actorizer: removed nested mclapply Erik de Vries 2019-03-04 12:08:53 +0100
  • 1a4ba19546 actorizer: Removed udmodel dependencies, commented code, changed nested lists to flat lists bulk_writer: changed handling of single-row dataframe parsing to JSON elastic_update: changed function to return instead of print appData on error ud_update: Changed nested lists to flat lists, and added start and end character positions Erik de Vries 2019-02-26 11:22:14 +0100
  • 3abc3056e0 actorizer: fix to columns selected for actors variable, removed udmodel requirement Erik de Vries 2019-02-20 13:58:51 +0100
  • 41c86ea116 actorizer, ud_update: Updated ud parsing and actorizer to work based on character positions. This code is used for local testing Erik de Vries 2019-02-20 13:42:10 +0100
  • eae1a22609 actorizer: update to use '|||' as highlight indicator, and set up ud output merging accordingly Erik de Vries 2019-02-11 16:43:16 +0100
  • 5665b6d622 actorizer: more fixes to punctuation Erik de Vries 2019-02-05 14:33:55 +0100
  • cd05733648 actorizer: Additional fix for missing punctuation (see previous commit) Erik de Vries 2019-02-05 14:26:28 +0100
  • 09732a1b5a actorizer: quick fix for problem where original UK UD output does not have a dot at the end of the document, but the actor output does (old vs new parsing) Erik de Vries 2019-02-05 14:10:27 +0100
  • 835d2332bc actorizer: now uses the original udpipe output for sentence and token ids. When the actorized and original udpipe output do not have the same number of rows, it prints an error and sets err to TRUE in actorDetails Erik de Vries 2019-02-05 13:26:24 +0100
  • e70b6ccf7a actorizer: fixed sentence_count and out_parser calls out_parser: Added comment with old regex Erik de Vries 2019-02-04 14:16:04 +0100
  • 9b0ac775af class_update: add ver variable to set version for class updated articles Erik de Vries 2019-01-16 19:36:03 +0100
  • 85306007f4 class_update: added words and clean parameters, in addition to text parameter, to be able to set data preprocessing exactly the same as in the trained model Erik de Vries 2019-01-16 19:34:37 +0100
  • e110780ad5 merger: idiotic fix for a non-problem, see comment on line 32 Erik de Vries 2019-01-16 19:21:20 +0100
  • ce5f812252 dfm_gen, merger: Added option for generating lemma_upos hybrids for merged field merger: Added custom clean option (sometimes not cleaning is preferred, even with lemmas) merger, out_parser: Updated regex for filtering out non-words to also include email addresses (containing both @ and .) Erik de Vries 2019-01-16 18:29:30 +0100
  • 386ac42aee lemma_writer: new function to write raw lemma's (without interpunction) to text file. Is structured as elasticizer update function (despite not updating anything on the server) Erik de Vries 2019-01-15 11:36:51 +0100
  • 4407a99774 actorizer: fix to get actual number of sentence occurences of actor Erik de Vries 2019-01-14 17:25:45 +0100
  • 96e869fa6b actorizer: previous commit was wrong, only add is an option, removed type variable Erik de Vries 2019-01-14 17:06:43 +0100
  • 98219c807c actorizer: Added type option, to choose between setting or adding to the actor variables, defaults to add (should normally not be changed) Erik de Vries 2019-01-14 17:05:30 +0100
  • e3b57ed9e3 actorizer: added clean = F to have the exact same behavior in ud_update and actorizer Erik de Vries 2019-01-14 14:48:04 +0100
  • 7218f6b8d0 dupe_detect: fixed error on no duplicates Erik de Vries 2019-01-11 15:38:19 +0100
  • b9be372543 dupe_detect: fix to get correct colnames from simil (disable stringsAsFactors and convert col values to numeric) Erik de Vries 2019-01-11 15:23:18 +0100
  • 1955692346 dfm_gen, out_parser: updated documentation dupe_detect: major fix to function, no longer using rownames for article ids Erik de Vries 2019-01-11 14:45:35 +0100
  • 34531b0da8 out_parser: added option to clean output using regex to remove numbers and non-words dfm_gen, ud_update: updated functions to make use of out_parser cleaning option merger: updated regex for cleaning lemmatized output Erik de Vries 2019-01-11 13:59:19 +0100
  • 5851c56369 query_string: updated check for fields value Erik de Vries 2019-01-09 14:04:38 +0100
  • 4f8b1f2024 elasticizer: renamed size parameter to batch_size, created max_batch parameter to limit the number of results returned query_string: renamed x parameter to query, added fields parameter to select what fields to return and random boolean parameter to define whether the returned results should be randomized Erik de Vries 2019-01-09 13:52:51 +0100
  • d0e9bf565b dupe_detect: Reset the _delete value to 1 out_parser: fix to sentence parsing, add additional (empty) string at end of merged field, to make merged field end on . Erik de Vries 2019-01-08 15:07:40 +0100
  • ea8cfb071f dupe_detect: updated _delete var to be 2 when delete is true Erik de Vries 2019-01-03 23:35:28 +0100
  • 0a3bdb630b actorizer, dfm_gen, ud_update: unified output parsing from _source and highlight fields into a single function (out_parser) out_parser: function to parse raw text output into a single field, either from _source or highlight fields dupe_detect: updated function to use 'ver' parameter for versioning Erik de Vries 2019-01-02 18:11:34 +0100
  • 9e5a1e3354 ud_update: removed mc.preschedule = F Erik de Vries 2018-12-30 20:37:42 +0100
  • c7560d7e32 ud_update: Removed . at end of text, and added mc.preschedule = F for testing Erik de Vries 2018-12-30 20:34:42 +0100
  • 37df81b8ff ud_update: fixed merged output field to always contain an (extra) dot (period) at the end of the document Erik de Vries 2018-12-30 20:20:38 +0100
  • c32c9e5ad3 ud_update: fix to deal with non-existing column names Erik de Vries 2018-12-30 19:34:48 +0100
  • 8ffbddc073 actorizer, ud_update: implemented 'ver' variable for keeping track of updates Erik de Vries 2018-12-30 19:10:59 +0100
  • ae23456736 actorizer, ud_update: Updated merging of document fields to properly deal with missing punctuation at the end of fields (e.g. a title without punctuation at the end of the string) Erik de Vries 2018-12-30 19:02:17 +0100
  • 9f3418ef37 class_update; dfm_gen; merger: updated functions to accept text parameter for both old style 'lemmas' and new style 'ud' Erik de Vries 2018-12-15 13:02:22 +0100
  • 85aab558e0 bulk_writer: added clause to varname==ud update to also remove the tokens variable from source Erik de Vries 2018-12-13 21:24:29 +0100
  • 581e7b2929 DESCRIPTION: added SparseM as required package Erik de Vries 2018-12-13 14:05:09 +0100
  • 54dfb6a8ca actorizer: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them ud_update: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them elastic_update: set the minimum break between retries from 10 to 30 seconds elasticizer: implementation of retries for elasticizer function, 10 retries with a break of 30 seconds in between Erik de Vries 2018-12-13 13:40:03 +0100
  • b042fdb1e3 Merge branch 'master' of https://git.thijsdevries.net/edevries/mamlr Erik de Vries 2018-12-13 12:10:22 +0100
  • 8caf53b90a actorizer: switched to single core processing for debugging Erik de Vries 2018-12-13 12:03:23 +0100
  • e5c87cf69d actorizer: more debug prints Erik de Vries 2018-12-13 12:03:23 +0100
  • c63409238b actorizer: print row numbers for debugging Erik de Vries 2018-12-13 11:36:30 +0100
  • 39005c7518 elasticizer: Updated bulk size to 1024 (a power of 2) and set a timeout of 900s every 500000 updates query_gen_actors: Added an additional generator for the "Institution" type (for EU support) actorizer: Created an updater function to search for actors and use UDPipe to parse the results Erik de Vries 2018-12-12 19:01:10 +0100
  • a3c3651c79 elasticizer: updated scroll time to be longer than the timeouts every 200000 articles (so 20m scroll time, 900s (15m) timeout) Erik de Vries 2018-12-11 11:55:14 +0100
  • 4ad5357e15 elasticizer: Added 900s timeout after every batch of 200000 articles when updating, to allow ES to do some segment merges (and clean up disk space) DKJunk Erik de Vries 2018-12-11 11:02:19 +0100
  • a5ba00146f modelizer: fixed error when only one class is predicted for junk classification (borderline case) Erik de Vries 2018-12-09 13:41:34 +0100
  • a13d86b92d modelizer: added some more debug output Erik de Vries 2018-12-09 13:04:03 +0100
  • 23658ce51a test Erik de Vries 2018-12-08 19:36:10 +0100
  • 17cf6d04e9 modelizer: debug update Erik de Vries 2018-12-08 19:29:30 +0100
  • 7544e5323f modelizer: update to allow tf both as count (for naive bayes), and as proportion (for other machine learning algorithms) Erik de Vries 2018-12-08 17:44:05 +0100
  • 5f5e4a03c8 modelizer: Changed tf-idf weighting from absolute tf count to proportional (normalized) tf! Also added initial support for neural networks Erik de Vries 2018-12-07 14:35:50 +0100
  • 34a6adf64e changed udpipe output variable from tokens to ud Erik de Vries 2018-12-05 16:51:59 +0100
  • 061da17c2a ud_update: Added function to lemmatize documents Erik de Vries 2018-12-05 15:02:41 +0100
  • ef51ce60a9 Fixed dupe_detect error on documents with one sentence or less, and a maximum # of words in dfm_gen DupeDetect Erik de Vries 2018-12-04 17:44:43 +0100