47 Commits (e3c8d04984ceca6f653d838c6e5e56a61042f144)

Author SHA1 Message Date
Erik de Vries 7703a8cd5b query_gen_actors: removed country argument, now reading country directly from actor data
4 years ago
Erik de Vries 17d49f07c0 updated namespace and docs
4 years ago
Erik de Vries 4a0f2206fd removed multicore support, added parameters for dfm_gen
4 years ago
Your Name b7f1afddd1 actor_merger: total rewrite based on data.table for performance reasons. Added some exceptions due to non-existing partyIds that some individual actors have in the actor database
4 years ago
Your Name 77eb51a1bf actorizer: totally revamped way of finding actors
4 years ago
Your Name 4b4d860235 class_update: remove dfm_gen multicore option
5 years ago
Your Name 5d99ec9509 elasticizer: added option to dump data frames to rds files
5 years ago
Your Name 11bf71c7dd fixes for removal of actor_fetcher function
5 years ago
Your Name f022312485 actor_merger: added function for generating actor-document data frames
5 years ago
Your Name 98325bde8f sentencizer: added new function for sentiment coding and actor collection
5 years ago
Your Name 9eae486a80 separated data preprocessing routines
5 years ago
Your Name a3b6e19646 revised modeling pipeline:
5 years ago
Erik de Vries 889e7e92af lemma_writer: updated to provide support for writing raw documents to individual files using utf-8 encoding
5 years ago
Erik de Vries 6f5ace8c52 actor_fetcher: elasticizer batch function to fetch actorsDetail fields from all relevant documents
6 years ago
Erik de Vries 0d81d6fc7a added aggregator and aggregator_elastic functions for aggregating and storing article level actor aggregations
6 years ago
Erik de Vries e3b26c0be3 actor_aggregation: Added function to generate aggregate actor measures at daily, weekly, monthly and yearly level
6 years ago
Erik de Vries 8051a81b66 actorizer, dfm_gen, modelizer, out_parser: replaced all instances of detectCores by cores parameter (which defaults to detectCores)
6 years ago
Erik de Vries a1b6c6a7cb actorizer, query_gen_actors: revamped actor searches entirely
6 years ago
Erik de Vries 5b9793cd8c actorizer: removed nested mclapply
6 years ago
Erik de Vries 9b0ac775af class_update: add ver variable to set version for class updated articles
6 years ago
Erik de Vries 85306007f4 class_update: added words and clean parameters, in addition to text parameter, to be able to set data preprocessing exactly the same as in the trained model
6 years ago
Erik de Vries ce5f812252 dfm_gen, merger: Added option for generating lemma_upos hybrids for merged field
6 years ago
Erik de Vries 386ac42aee lemma_writer: new function to write raw lemma's (without interpunction) to text file. Is structured as elasticizer update function (despite not updating anything on the server)
6 years ago
Erik de Vries 1955692346 dfm_gen, out_parser: updated documentation
6 years ago
Erik de Vries 34531b0da8 out_parser: added option to clean output using regex to remove numbers and non-words
6 years ago
Erik de Vries 4f8b1f2024 elasticizer: renamed size parameter to batch_size, created max_batch parameter to limit the number of results returned
6 years ago
Erik de Vries 0a3bdb630b actorizer, dfm_gen, ud_update: unified output parsing from _source and highlight fields into a single function (out_parser)
6 years ago
Erik de Vries 8ffbddc073 actorizer, ud_update: implemented 'ver' variable for keeping track of updates
6 years ago
Erik de Vries ae23456736 actorizer, ud_update: Updated merging of document fields to properly deal with missing punctuation at the end of fields (e.g. a title without punctuation at the end of the string)
6 years ago
Erik de Vries 9f3418ef37 class_update; dfm_gen; merger: updated functions to accept text parameter for both old style 'lemmas' and new style 'ud'
6 years ago
Erik de Vries 39005c7518 elasticizer: Updated bulk size to 1024 (a power of 2) and set a timeout of 900s every 500000 updates
6 years ago
Erik de Vries 061da17c2a ud_update: Added function to lemmatize documents
6 years ago
Erik de Vries ef51ce60a9 Fixed dupe_detect error on documents with one sentence or less, and a maximum # of words in dfm_gen
6 years ago
Erik de Vries 0e8c127b86 bulk_writer: fixes for JSON generation and added exception for use of 'tokens' varname
6 years ago
Erik de Vries 085252abda documentation: updated dupe_detect and merger
6 years ago
Erik de Vries f543d658bd Major overhaul to ES bulk update integration. Added support for both setting and appending to variables
6 years ago
Erik de Vries 4cd46d1a5e dupe_detect: added support for both lower and upper cutoff point
6 years ago
Erik de Vries 11d8b31c60 Added generic actor search query generator. Updated elasticizer and elastic_update to connect either to the remote server, or a local ES instance
6 years ago
Erik de Vries adc4b3c639 Updated feature selection in modelizer function (see comment on lines 166/167)
6 years ago
Erik de Vries 65f8c26ec6 Renamed dupe_detect, and added return output
6 years ago
Erik de Vries db418d7396 Add query_string function for generating query_string queries
6 years ago
Erik de Vries d203de0b2a Updated elasticizer docs, created modelizer and class_update functions
6 years ago
Erik de Vries c815dc7f2b Duplicate detection first commit
6 years ago
Erik de Vries 217ee76568 V 0.1 for elasticizer function with updater support
6 years ago
Erik de Vries a273524105 Added support for custom update function to elasticizer
6 years ago
Erik de Vries 0e45c0f2d1 Added option for fulltext vs lemmas merged field
6 years ago
Erik de Vries 4bbe84ab83 First release of mamlr package
6 years ago