mamlr

edevries

mamlr

Archived

Commit Graph

Select branches

Hide Pull Requests

master

DKJunk

DupeDetect

NOJunk

fc16cc5833 mamlr final commit master Erik de Vries 2025-05-07 11:49:56 +0200
bbec8f5547 fix in package version check Erik de Vries 2024-09-12 17:20:09 +0200
e3c8d04984 update Erik de Vries 2023-06-10 18:25:57 +0200
0f7b1ee537 Add single_party param Fix actor.first to use min() instead of first() Erik de Vries 2023-03-27 17:28:24 +0200
5c80d82828 reintroduced certificate checks, linux01 certs work again Erik de Vries 2022-11-24 13:30:21 +0100
fcdffb6f58 removed default_field, so that all text fields are queried by default (this also includes any coder comments!) Erik de Vries 2022-11-22 16:45:29 +0100
9ae2866c41 remove default user Erik de Vries 2022-09-12 18:14:01 +0200
b130f9c313 added es_user parameter Erik de Vries 2022-09-12 18:13:09 +0200
3f268bbf06 Temporarily disable SSL verification Erik de Vries 2022-09-07 17:03:12 +0200
2944039f73 test Erik de Vries 2022-01-25 18:57:12 +0100
0b17555d99 sent_merger: Correctly add party metadata for _mfsa aggregations Erik de Vries 2022-01-25 18:39:27 +0100
108372452c sent_merger: Correctly add party metadata for _mfsa aggregations Erik de Vries 2022-01-25 18:39:27 +0100
16d02a055d sent_merger: Updated sentiment aggregation procedure. Now a dedicated actors_final.csv file is used as source of partyIds for individual actors, instead of the (deprecated) [partyId]_a ids that were previously provided as a result of the actor searches, or the (also deprecated) actor metadata provided in the ES actors database. Erik de Vries 2022-01-25 17:57:53 +0100
8875630235 fixed actor metadata generation as well, because the same actorId might occur multiple times in a sentence, if that actor has multiple functions during the same period. Erik de Vries 2021-05-08 11:20:20 +0200
9419d6dc08 Fixed incorrect mfs and mfsa aggregations. Previously multiple party/actor mentions in the same sentence (e.g. both a *_f and *_s mention) would all be taken into account separately, while the sentence should only be considered once Erik de Vries 2021-05-07 15:34:59 +0200
7703a8cd5b query_gen_actors: removed country argument, now reading country directly from actor data Erik de Vries 2021-01-22 19:35:17 +0100
64a48e5977 sent_merger: fixed bug with publication_date and grouper() Erik de Vries 2021-01-20 18:17:53 +0100
f6dfc6711b minor fix Erik de Vries 2020-10-21 15:38:10 +0200
09fd8d0cb2 removed some unused aggregations Erik de Vries 2020-10-21 15:35:04 +0200
17d49f07c0 updated namespace and docs Erik de Vries 2020-10-21 13:58:27 +0200
8ff4097304 renamed actor_merger to sent_merger and implemented fixes to work with sentiment data frames without actor ids Erik de Vries 2020-10-21 13:50:15 +0200
a37fc0410d removed sent_sum_pos/neg Erik de Vries 2020-10-16 17:03:51 +0200
153c54b376 reintroduced arousal, but should be warned that arousal performance is not directly evaluated Erik de Vries 2020-10-16 16:33:10 +0200
cdc78039ed removing text-level output from sentencizer, and optimizing storage by using factors Erik de Vries 2020-10-16 15:11:13 +0200
523d86799c removed arousal measures Erik de Vries 2020-10-16 14:09:38 +0200
4a0f2206fd removed multicore support, added parameters for dfm_gen Erik de Vries 2020-10-15 16:53:49 +0200
274c9179cb remove meta_file argument Your Name 2020-08-24 16:10:52 +0200
6e0e693d4e lemma_writer: removed meta csv code Your Name 2020-08-24 16:08:51 +0200
4fd9222a2d lemma_writer: updated to write metadata csv when dumping documents in ud format out_parser: fix for generating empty columns Your Name 2020-08-24 15:50:10 +0200
955f034e6a actor_merger: changed computation of arousal, and removed uninformative variables Your Name 2020-07-24 16:09:20 +0200
3cdb68b196 out_parser: updated fncols function Your Name 2020-07-23 13:14:31 +0200
dc40fbbb19 elasticizer: update rbindlist implementation Your Name 2020-07-23 13:04:29 +0200
18d47762d2 actor_merger: overhaul to include cutoffs at sentence level as intended, also included options to generate sentiment for text only (don't provide actors_meta or actor_groups) Your Name 2020-07-22 11:36:12 +0200
74909ca3a0 sentencizer: removed text sentiment computation from script, because of incorrect implementation Your Name 2020-07-22 10:12:01 +0200
c99ac23bb5 actor_merger: fixed absence of publication_date in some cases Your Name 2020-07-21 16:19:28 +0200
cc7fa5bffa actor_merger: added aggregations of all individual actors and all party mentions in an article Your Name 2020-07-20 15:27:32 +0200
d9d578c06a actor_merger: mult fix Your Name 2020-07-19 19:11:55 +0200
771145faf7 actor_merger: added mult='first' to metadata join for parties_actors to deal with duplicate partyIds (see 50Plus, Conservatives and Labour) Your Name 2020-07-19 19:08:16 +0200
1c14646e8f actor_merger: dont deselect sent_words and sent_sum columns Your Name 2020-07-19 18:42:47 +0200
9bd382f955 actor_merger: fix to generate bogus sentiment columns Your Name 2020-07-19 18:40:10 +0200
b7f1afddd1 actor_merger: total rewrite based on data.table for performance reasons. Added some exceptions due to non-existing partyIds that some individual actors have in the actor database Your Name 2020-07-19 18:22:35 +0200
2c8a88f9a0 elasticizer: switched from bind_rows to rbindlist for composing result actor_merger: added noactor.* sentiment columns, and switched to data.table for matching actor metadata with articles Your Name 2020-07-17 13:46:31 +0200
559199bb97 sentencizer: totally removed sent_lemmas field Your Name 2020-07-08 16:13:07 +0200
36f2b341a8 sentencizer: removed derived output from function Your Name 2020-07-08 16:09:04 +0200
80ec0be1f8 actorizer: updated to account for token start offset in udpipe output. Sometimes, the first token in an article doesn't start at character position 1 (or 2 if the article starts with a whitespace), but at position 16 and possibly other positions. Your Name 2020-07-06 17:50:04 +0200
336567732c elastic_update: added more debug output Your Name 2020-07-06 11:17:53 +0200
df7631b9f1 sentencizer: Changed output, removed lemma list and added separate positive and negative sentiment sums Your Name 2020-07-05 13:15:02 +0200
ecdb5be3b4 actorizer: moved some code Your Name 2020-07-03 14:06:18 +0200
50f33e78d7 DESCRIPTION: updated Your Name 2020-07-03 14:03:52 +0200
69d4b6f5b0 actorizer: updated to data.table for conditional joins DESCRIPTION: added data.table dependency Your Name 2020-07-03 14:00:43 +0200
085855908c query_gen_actors: switched from Minister to Min Your Name 2020-07-02 10:07:58 +0200
b406304c80 actorizer: Removed nested parallelization function query_gen_actors: Integrated startDate and endDate for parties, changed party exception method from abbreviation only to both full names and abbreviations for NL and BE Your Name 2020-07-01 19:25:50 +0200
5de4e1488c estimator, modelizer, preproc: Removed experimental we-vector support, and disabled inefficiently implemented preproc.R Your Name 2020-06-22 15:07:46 +0200
77eb51a1bf actorizer: totally revamped way of finding actors elasticizer: updated dump handling to create a dump for every batch, instead of one big file at the end out_parser: streamlined code query_gen_actors: only include relevant fields ud_update: changed function parameters to work with elasticizer dump function Your Name 2020-06-19 11:34:18 +0200
0e593075ee query_gen_actors: only retrieve ud field from source Your Name 2020-06-15 19:04:26 +0200
6eb405f8bd merger: selecting only relevant columns Your Name 2020-06-15 18:30:03 +0200
38ff4dcbf0 ud_update: small fix to file naming Your Name 2020-06-15 18:26:26 +0200
4b4d860235 class_update: remove dfm_gen multicore option dfm_gen: remove multicore, update merger() code elasticizer: changed filenaming scheme for dump option merger: Fixed bug where an NA lemma would cause the entire document to become NA. Now the NA lemmas are filtered out before merging ud_update: removed parallel processing, changed script to save bulk updates in .Rds files instead of sending them straight away Your Name 2020-06-15 18:25:16 +0200
5d99ec9509 elasticizer: added option to dump data frames to rds files out_parser: changed to single core, due to performance increase sentencizer: corrected documentation for sent_dict parameter Your Name 2020-06-10 17:58:12 +0200
aa6587b204 dupe_detect: fix for quotation marks Your Name 2020-06-10 15:22:41 +0200
2a220ded5d dupe_detect: fix to query string for multi-word doctype names Your Name 2020-06-10 15:06:35 +0200
5bd36dcb44 dupe_detect: Changed query from json to query_string style, and added filter for already detected duplicates cv_generator: Changed code to use a generic vector of true values to draw the conditional random sample, instead of dfm/docvars specifically Your Name 2020-06-09 12:13:37 +0200
e499d70671 actor_merger: added ungroup() calls at the start and end of function, to speed up processing sentencizer: added ungroup() call at the end of the function to speed up processing Your Name 2020-05-27 13:13:21 +0200
8634d549a3 sentencizer: updates to collect sentence word counts and number of sentences also when no sent_dict is provided Your Name 2020-05-26 18:37:26 +0200
61e0581595 actor_merger: removed debug line Your Name 2020-05-26 17:48:10 +0200
11bf71c7dd fixes for removal of actor_fetcher function Your Name 2020-05-26 17:15:14 +0200
f022312485 actor_merger: added function for generating actor-document data frames actor_fetcher: removed from package other: major update to documentation Your Name 2020-05-26 17:12:22 +0200
4e867214dd sentencizer: commented code Your Name 2020-05-26 15:33:28 +0200
ec8afc4990 sentencizer: fixed actorsDetail coding error Your Name 2020-05-25 16:16:42 +0200
9ccfd2952e sentencizer: minor updates Your Name 2020-05-25 15:48:46 +0200
98325bde8f sentencizer: added new function for sentiment coding and actor collection Your Name 2020-05-22 21:43:27 +0200
7f958bbc11 actor_fetcher: small fixes Your Name 2020-05-20 13:56:42 +0200
8eedec8bb5 actor_fetcher: added option for using dictionaries with just lemmas, besides the option of using lemma_upos dictionaries Your Name 2020-05-20 12:44:09 +0200
057d225a7a actor_fetcher: Allow generation of actor df containing only specified actor ids and aggregations Your Name 2020-05-20 12:29:26 +0200
9eae486a80 separated data preprocessing routines class_update: check if there are idf values associated with model, before applying weights estimator: make use of preproc() function for data preprocessing preproc: function containing all logic with regards to text data preprocessing and weighting Your Name 2020-04-09 15:32:07 +0200
a3b6e19646 revised modeling pipeline: cv_generator: generate folds for nested cv dfm_gen: added optional lowercasing parameter estimator: estimate model and performance based on parameters feat_select: select features based on textstat_keyness metric_gen: convert output from estimator to model performance metrics modelizer: updated for new pipeline modelizer_old: old model pipeline out_parser: now correctly exported Your Name 2020-04-09 14:02:50 +0200
e76a914dd2 actor_fetcher: Updated to tidyr 1.0.0, no longer using preserve, slightly different approach to keeping ids_list, and not removing actorsDetail anymore because it does not exist Your Name 2020-03-18 14:10:01 +0100
a01a53f105 class_update: added cores parameter for multicore processing of sources when using lemmas Your Name 2020-03-11 15:44:52 +0100
d9f936c566 modelizer: tf-idf application updated, final model now also includes idf values from training set, explicitly setting positive category in binary classification for confusion matrices, minor code fixes dfm_gen: added old junk codes for recoding, and removed deprecated ngrams parameter from dfm function class_update: removed dfm_words parameter, which is replaced by the force = T parameter in predict(), training/model idf is now applied to unseen data DESCRIPTION: added quanteda.textmodels as new dependency, since these have been separated from base quanteda 2.0.0 onwards Your Name 2020-03-11 15:35:04 +0100
06bfec71bc lemma_writer: unlist lemmas before writing Erik de Vries 2019-09-01 13:23:24 +0200
a83ee5dfd0 lemma_writer: update to write lemma instead of full document text Erik de Vries 2019-09-01 13:13:08 +0200
e594185719 dfm_gen: set default cores to 1 Erik de Vries 2019-08-30 13:51:59 +0200
889e7e92af lemma_writer: updated to provide support for writing raw documents to individual files using utf-8 encoding Erik de Vries 2019-08-28 15:52:52 +0200
115297f597 actor_aggregation,aggregator,aggregator_elastic: moved out of package directory to Old actor_fetcher: moved sentiment validation code block Erik de Vries 2019-08-12 13:50:31 +0200
3fcbbd1f1f actor_fetch: fixed error where source.ud would not exist Erik de Vries 2019-07-06 18:34:25 +0200
674ef09e10 query_gen_actors: added junior minister check to if statement Erik de Vries 2019-07-06 14:47:58 +0200
853c117daf actor_fetcher: change in code to keep original actorid lists in output query_gen_actors: added code for junior ministers in BE and NL Erik de Vries 2019-07-05 14:43:15 +0200
bf3d11ffe0 query_gen_actors: various bugfixes and changes Erik de Vries 2019-07-04 17:11:58 +0200
99af1427f0 query_gen_actors: fixed scandinavian query generation Erik de Vries 2019-07-03 11:48:04 +0200
e49a4ae93e query_gen_actors: fixed problem with too many brackets in query Erik de Vries 2019-07-03 11:24:33 +0200
060751237b actorizer, out_parser: switched from mclapply to future_lapply and removed windows-specific code from out_parser query_gen_actors: rewritten minister queries to only use proximity queries Erik de Vries 2019-07-02 15:29:31 +0200
d0601d2aa7 actor_fetcher: added minimum verbosity to identify cases in which an actor is present without a party mention Erik de Vries 2019-06-25 19:43:35 +0200
82ef165c5f actor_fetcher: quick fix Erik de Vries 2019-06-25 19:13:51 +0200
9e433ecf9e actor_fetcher: added handling of exception where all actorsids related to a party are individual actors Erik de Vries 2019-06-25 19:08:12 +0200
526270900c actor_fetcher: integrated party merging into actor_fetcher in what hopefully is the most efficient way Erik de Vries 2019-06-25 18:53:26 +0200
84df9658ff actor_fetcher: added lemma output when validating, to detect most problematic lemmas Erik de Vries 2019-06-25 15:28:23 +0200
499ee74f0d actor_fetcher: fixed code error Erik de Vries 2019-06-24 15:03:39 +0200
a3e8dcf96e actor_fetcher: switched from binary word sentiment scores to proximity scores (cosine similarity) Erik de Vries 2019-06-21 16:23:28 +0200
6f5ace8c52 actor_fetcher: elasticizer batch function to fetch actorsDetail fields from all relevant documents Erik de Vries 2019-06-21 15:35:04 +0200
edd4b785a5 actor_aggregation: updated to use future package for parallel processing as beta test for switching all parallel processing to future. Also disabled some of the aggregator output to save computation time Erik de Vries 2019-06-20 12:54:14 +0200

Commit Graph Select branches Hide Pull Requests master DKJunk DupeDetect NOJunk Mono Color

Commit Graph

Select branches

Hide Pull Requests

master

DKJunk

DupeDetect

NOJunk