201 Commits (17d49f07c0915debe6dced7786037a9076e9e608)

Author SHA1 Message Date
Your Name 8eedec8bb5 actor_fetcher: added option for using dictionaries with just lemmas, besides the option of using lemma_upos dictionaries
5 years ago
Your Name 057d225a7a actor_fetcher: Allow generation of actor df containing only specified actor ids and aggregations
5 years ago
Your Name 9eae486a80 separated data preprocessing routines
5 years ago
Your Name a3b6e19646 revised modeling pipeline:
5 years ago
Your Name e76a914dd2 actor_fetcher: Updated to tidyr 1.0.0, no longer using preserve, slightly different approach to keeping ids_list, and not removing actorsDetail anymore because it does not exist
5 years ago
Your Name a01a53f105 class_update: added cores parameter for multicore processing of sources when using lemmas
5 years ago
Your Name d9f936c566 modelizer: tf-idf application updated, final model now also includes idf values from training set, explicitly setting positive category in binary classification for confusion matrices, minor code fixes
5 years ago
Erik de Vries 06bfec71bc lemma_writer: unlist lemmas before writing
5 years ago
Erik de Vries a83ee5dfd0 lemma_writer: update to write lemma instead of full document text
5 years ago
Erik de Vries e594185719 dfm_gen: set default cores to 1
5 years ago
Erik de Vries 889e7e92af lemma_writer: updated to provide support for writing raw documents to individual files using utf-8 encoding
5 years ago
Erik de Vries 115297f597 actor_aggregation,aggregator,aggregator_elastic: moved out of package directory to Old
5 years ago
Erik de Vries 3fcbbd1f1f actor_fetch: fixed error where source.ud would not exist
5 years ago
Erik de Vries 674ef09e10 query_gen_actors: added junior minister check to if statement
5 years ago
Erik de Vries 853c117daf actor_fetcher: change in code to keep original actorid lists in output
5 years ago
Erik de Vries bf3d11ffe0 query_gen_actors: various bugfixes and changes
5 years ago
Erik de Vries 99af1427f0 query_gen_actors: fixed scandinavian query generation
5 years ago
Erik de Vries e49a4ae93e query_gen_actors: fixed problem with too many brackets in query
5 years ago
Erik de Vries 060751237b actorizer, out_parser: switched from mclapply to future_lapply and removed windows-specific code from out_parser
5 years ago
Erik de Vries d0601d2aa7 actor_fetcher: added minimum verbosity to identify cases in which an actor is present without a party mention
5 years ago
Erik de Vries 82ef165c5f actor_fetcher: quick fix
5 years ago
Erik de Vries 9e433ecf9e actor_fetcher: added handling of exception where all actorsids related to a party are individual actors
5 years ago
Erik de Vries 526270900c actor_fetcher: integrated party merging into actor_fetcher in what hopefully is the most efficient way
5 years ago
Erik de Vries 84df9658ff actor_fetcher: added lemma output when validating, to detect most problematic lemmas
5 years ago
Erik de Vries 499ee74f0d actor_fetcher: fixed code error
5 years ago
Erik de Vries a3e8dcf96e actor_fetcher: switched from binary word sentiment scores to proximity scores (cosine similarity)
6 years ago
Erik de Vries 6f5ace8c52 actor_fetcher: elasticizer batch function to fetch actorsDetail fields from all relevant documents
6 years ago
Erik de Vries edd4b785a5 actor_aggregation: updated to use future package for parallel processing as beta test for switching all parallel processing to future. Also disabled some of the aggregator output to save computation time
6 years ago
Erik de Vries f8bc53006d actor_aggregation: added sentiment analysis support for generating aggregations
6 years ago
Erik de Vries d3d4045f1c actor_aggregation: added sentence count to output, and changed occurences to count instead of mean, changed prom and rel_first to prom_art and rel_first_art, changed output filename to include function
6 years ago
Erik de Vries 176a8f6de4 elasticizer: added additional verbosity on errors
6 years ago
Erik de Vries d420b02c20 elasticizer: Added more verbosity to investigate error handling
6 years ago
Erik de Vries 48b589dda0 query_gen_actors: reset to original state
6 years ago
Erik de Vries 7a01a7f18d query_gen_actors: temporary update for fixing broken shit
6 years ago
Erik de Vries 45da9dd929 aggregator_elastic: revert to single-core lapply, due to sendMaster errors
6 years ago
Erik de Vries f8e4111e70 aggregator_elastic: correct partyid implementation
6 years ago
Erik de Vries c047a4a1db aggregator_elastic: explicit reference to aggregator function
6 years ago
Erik de Vries 0d81d6fc7a added aggregator and aggregator_elastic functions for aggregating and storing article level actor aggregations
6 years ago
Erik de Vries 2281d11a68 actor_aggregation: fixed filenaming of .Rds files
6 years ago
Erik de Vries d9f28a46d8 actor_aggregation: small fixes to code
6 years ago
Erik de Vries a29d04dacd actorizer: fixed handling of empty results due to regex filtering
6 years ago
Erik de Vries 8e920f5f37 elasticizer: removed idiotic 15min sleep time after 500 batches
6 years ago
Erik de Vries a11d7728ea actor_aggregation: only aggregate scores on non-junk articles
6 years ago
Erik de Vries 54a70c47a0 actor_aggregation: removed timeout for parallel processing, requires fix in elasticizer (cannot recycle the same connection)
6 years ago
Erik de Vries 58fce4d560 actor_aggregation: added randomized short sleep, to allow for parallel execution
6 years ago
Erik de Vries e3b26c0be3 actor_aggregation: Added function to generate aggregate actor measures at daily, weekly, monthly and yearly level
6 years ago
Erik de Vries 28989f2bc4 dfm_gen: yet another fix for codes
6 years ago
Erik de Vries 0757b6bf8b dfm_gen: re-added codes variable
6 years ago
Erik de Vries 2fc48cc2f7 dfm_gen: fixed absence of out$codes field
6 years ago
Erik de Vries b249ff22de dfm_gen.R: fixed junk mutation
6 years ago
Erik de Vries 0d05765ca7 dfm_gen: removed last remains of summer sample exceptions
6 years ago
Erik de Vries e199b23227 dfm_gen: removed exceptions for NO summer codes
6 years ago
Erik de Vries fbd525dc2e modelizer: updated outer cross validation procedure to output raw prediction and true values, instead of processed and aggregated confusion matrix results
6 years ago
Erik de Vries 6a94bc3ed8 query_gen_actors: removed quotation marks from Minister search part
6 years ago
Erik de Vries 8d19333e59 query_gen_actors: changed script order for belgium exceptions
6 years ago
Erik de Vries 3bfe61e425 query_gen_actors: fixed implementation of Belgian exceptions
6 years ago
Erik de Vries 81697345cb modelizer: removed breaking code
6 years ago
Erik de Vries 9ca952ca89 elastic_update: removed wait_for from url
6 years ago
Erik de Vries 8051a81b66 actorizer, dfm_gen, modelizer, out_parser: replaced all instances of detectCores by cores parameter (which defaults to detectCores)
6 years ago
Erik de Vries ac37d836f5 elasticizer: added scroll_clear to null hits as well
6 years ago
Erik de Vries 75623856f7 elasticizer: updated scroll_clear to use conn object
6 years ago
Erik de Vries c2d666c81d bogus commit
6 years ago
Erik de Vries e34460bf0f elasticizer: clear scroll context when finishing query
6 years ago
Erik de Vries 9bd526fee0 elasticizer: fixed compatibility issues with elastic v1.0.0
6 years ago
Erik de Vries f2312f65d5 elasticizer: update to account for syntax change in newer package versions
6 years ago
Erik de Vries f6006eb9ba actorizer: simplified pre/postfix check, only for NA, replace empty strings by NA beforehand
6 years ago
Erik de Vries 298099a4e6 actorizer: fix to deal with empty updates (ie dont do an update)
6 years ago
Erik de Vries 6961c0b866 query_gen_actors: updated actorid filter to use the keyword subfield
6 years ago
Erik de Vries 703b5e59a4 actorizer: fixed exceptionizer by adding whitespace before and after sentence, which is necessary because of negative regex (match anything before or after the highlight string that is NOT x actually requires something to be in front or after)
6 years ago
Erik de Vries 593d2de6e2 actorizer: add pre_tags and post_tags to argument list
6 years ago
Erik de Vries a1b6c6a7cb actorizer, query_gen_actors: revamped actor searches entirely
6 years ago
Erik de Vries 88fc4ec53c dfm_gen: changed out_parser call to mamlr:::out_parser
6 years ago
Erik de Vries 90fdbcc982 out_parser: parallelized when not in windoze
6 years ago
Erik de Vries 6414f759bd actorizer: parallelized calculation of marker positions
6 years ago
Erik de Vries 522c872dba out_parser: moved cleaning regex to end of pipeline, to prevent collissions with other (mandatory) regex cleaning
6 years ago
Erik de Vries 5b9793cd8c actorizer: removed nested mclapply
6 years ago
Erik de Vries 1a4ba19546 actorizer: Removed udmodel dependencies, commented code, changed nested lists to flat lists
6 years ago
Erik de Vries 3abc3056e0 actorizer: fix to columns selected for actors variable, removed udmodel requirement
6 years ago
Erik de Vries 41c86ea116 actorizer, ud_update: Updated ud parsing and actorizer to work based on character positions. This code is used for local testing
6 years ago
Erik de Vries eae1a22609 actorizer: update to use '|||' as highlight indicator, and set up ud output merging accordingly
6 years ago
Erik de Vries 5665b6d622 actorizer: more fixes to punctuation
6 years ago
Erik de Vries cd05733648 actorizer: Additional fix for missing punctuation (see previous commit)
6 years ago
Erik de Vries 09732a1b5a actorizer: quick fix for problem where original UK UD output does not have a dot at the end of the document, but the actor output does (old vs new parsing)
6 years ago
Erik de Vries 835d2332bc actorizer: now uses the original udpipe output for sentence and token ids. When the actorized and original udpipe output do not have the same number of rows, it prints an error and sets err to TRUE in actorDetails
6 years ago
Erik de Vries e70b6ccf7a actorizer: fixed sentence_count and out_parser calls
6 years ago
Erik de Vries 9b0ac775af class_update: add ver variable to set version for class updated articles
6 years ago
Erik de Vries 85306007f4 class_update: added words and clean parameters, in addition to text parameter, to be able to set data preprocessing exactly the same as in the trained model
6 years ago
Erik de Vries e110780ad5 merger: idiotic fix for a non-problem, see comment on line 32
6 years ago
Erik de Vries ce5f812252 dfm_gen, merger: Added option for generating lemma_upos hybrids for merged field
6 years ago
Erik de Vries 386ac42aee lemma_writer: new function to write raw lemma's (without interpunction) to text file. Is structured as elasticizer update function (despite not updating anything on the server)
6 years ago
Erik de Vries 4407a99774 actorizer: fix to get actual number of sentence occurences of actor
6 years ago
Erik de Vries 96e869fa6b actorizer: previous commit was wrong, only add is an option, removed type variable
6 years ago
Erik de Vries 98219c807c actorizer: Added type option, to choose between setting or adding to the actor variables, defaults to add (should normally not be changed)
6 years ago
Erik de Vries e3b57ed9e3 actorizer: added clean = F to have the exact same behavior in ud_update and actorizer
6 years ago
Erik de Vries 7218f6b8d0 dupe_detect: fixed error on no duplicates
6 years ago
Erik de Vries b9be372543 dupe_detect: fix to get correct colnames from simil (disable stringsAsFactors and convert col values to numeric)
6 years ago
Erik de Vries 1955692346 dfm_gen, out_parser: updated documentation
6 years ago
Erik de Vries 34531b0da8 out_parser: added option to clean output using regex to remove numbers and non-words
6 years ago
Erik de Vries 5851c56369 query_string: updated check for fields value
6 years ago
Erik de Vries 4f8b1f2024 elasticizer: renamed size parameter to batch_size, created max_batch parameter to limit the number of results returned
6 years ago