Your Name
0e593075ee
query_gen_actors: only retrieve ud field from source
5 years ago
Your Name
6eb405f8bd
merger: selecting only relevant columns
5 years ago
Your Name
38ff4dcbf0
ud_update: small fix to file naming
5 years ago
Your Name
4b4d860235
class_update: remove dfm_gen multicore option
...
dfm_gen: remove multicore, update merger() code
elasticizer: changed filenaming scheme for dump option
merger: Fixed bug where an NA lemma would cause the entire document to become NA. Now the NA lemmas are filtered out before merging
ud_update: removed parallel processing, changed script to save bulk updates in .Rds files instead of sending them straight away
5 years ago
Your Name
5d99ec9509
elasticizer: added option to dump data frames to rds files
...
out_parser: changed to single core, due to performance increase
sentencizer: corrected documentation for sent_dict parameter
5 years ago
Your Name
aa6587b204
dupe_detect: fix for quotation marks
5 years ago
Your Name
2a220ded5d
dupe_detect: fix to query string for multi-word doctype names
5 years ago
Your Name
5bd36dcb44
dupe_detect: Changed query from json to query_string style, and added filter for already detected duplicates
...
cv_generator: Changed code to use a generic vector of true values to draw the conditional random sample, instead of dfm/docvars specifically
5 years ago
Your Name
e499d70671
actor_merger: added ungroup() calls at the start and end of function, to speed up processing
...
sentencizer: added ungroup() call at the end of the function to speed up processing
5 years ago
Your Name
8634d549a3
sentencizer: updates to collect sentence word counts and number of sentences also when no sent_dict is provided
5 years ago
Your Name
61e0581595
actor_merger: removed debug line
5 years ago
Your Name
f022312485
actor_merger: added function for generating actor-document data frames
...
actor_fetcher: removed from package
other: major update to documentation
5 years ago
Your Name
4e867214dd
sentencizer: commented code
5 years ago
Your Name
ec8afc4990
sentencizer: fixed actorsDetail coding error
5 years ago
Your Name
9ccfd2952e
sentencizer: minor updates
5 years ago
Your Name
98325bde8f
sentencizer: added new function for sentiment coding and actor collection
5 years ago
Your Name
7f958bbc11
actor_fetcher: small fixes
5 years ago
Your Name
8eedec8bb5
actor_fetcher: added option for using dictionaries with just lemmas, besides the option of using lemma_upos dictionaries
5 years ago
Your Name
057d225a7a
actor_fetcher: Allow generation of actor df containing only specified actor ids and aggregations
5 years ago
Your Name
9eae486a80
separated data preprocessing routines
...
class_update: check if there are idf values associated with model, before applying weights
estimator: make use of preproc() function for data preprocessing
preproc: function containing all logic with regards to text data preprocessing and weighting
5 years ago
Your Name
a3b6e19646
revised modeling pipeline:
...
cv_generator: generate folds for nested cv
dfm_gen: added optional lowercasing parameter
estimator: estimate model and performance based on parameters
feat_select: select features based on textstat_keyness
metric_gen: convert output from estimator to model performance metrics
modelizer: updated for new pipeline
modelizer_old: old model pipeline
out_parser: now correctly exported
5 years ago
Your Name
e76a914dd2
actor_fetcher: Updated to tidyr 1.0.0, no longer using preserve, slightly different approach to keeping ids_list, and not removing actorsDetail anymore because it does not exist
5 years ago
Your Name
a01a53f105
class_update: added cores parameter for multicore processing of sources when using lemmas
5 years ago
Your Name
d9f936c566
modelizer: tf-idf application updated, final model now also includes idf values from training set, explicitly setting positive category in binary classification for confusion matrices, minor code fixes
...
dfm_gen: added old junk codes for recoding, and removed deprecated ngrams parameter from dfm function
class_update: removed dfm_words parameter, which is replaced by the force = T parameter in predict(), training/model idf is now applied to unseen data
DESCRIPTION: added quanteda.textmodels as new dependency, since these have been separated from base quanteda 2.0.0 onwards
5 years ago
Erik de Vries
06bfec71bc
lemma_writer: unlist lemmas before writing
5 years ago
Erik de Vries
a83ee5dfd0
lemma_writer: update to write lemma instead of full document text
5 years ago
Erik de Vries
e594185719
dfm_gen: set default cores to 1
5 years ago
Erik de Vries
889e7e92af
lemma_writer: updated to provide support for writing raw documents to individual files using utf-8 encoding
5 years ago
Erik de Vries
115297f597
actor_aggregation,aggregator,aggregator_elastic: moved out of package directory to Old
...
actor_fetcher: moved sentiment validation code block
5 years ago
Erik de Vries
3fcbbd1f1f
actor_fetch: fixed error where source.ud would not exist
5 years ago
Erik de Vries
674ef09e10
query_gen_actors: added junior minister check to if statement
5 years ago
Erik de Vries
853c117daf
actor_fetcher: change in code to keep original actorid lists in output
...
query_gen_actors: added code for junior ministers in BE and NL
5 years ago
Erik de Vries
bf3d11ffe0
query_gen_actors: various bugfixes and changes
5 years ago
Erik de Vries
99af1427f0
query_gen_actors: fixed scandinavian query generation
5 years ago
Erik de Vries
e49a4ae93e
query_gen_actors: fixed problem with too many brackets in query
5 years ago
Erik de Vries
060751237b
actorizer, out_parser: switched from mclapply to future_lapply and removed windows-specific code from out_parser
...
query_gen_actors: rewritten minister queries to only use proximity queries
5 years ago
Erik de Vries
d0601d2aa7
actor_fetcher: added minimum verbosity to identify cases in which an actor is present without a party mention
5 years ago
Erik de Vries
82ef165c5f
actor_fetcher: quick fix
5 years ago
Erik de Vries
9e433ecf9e
actor_fetcher: added handling of exception where all actorsids related to a party are individual actors
5 years ago
Erik de Vries
526270900c
actor_fetcher: integrated party merging into actor_fetcher in what hopefully is the most efficient way
5 years ago
Erik de Vries
84df9658ff
actor_fetcher: added lemma output when validating, to detect most problematic lemmas
5 years ago
Erik de Vries
499ee74f0d
actor_fetcher: fixed code error
5 years ago
Erik de Vries
a3e8dcf96e
actor_fetcher: switched from binary word sentiment scores to proximity scores (cosine similarity)
6 years ago
Erik de Vries
6f5ace8c52
actor_fetcher: elasticizer batch function to fetch actorsDetail fields from all relevant documents
6 years ago
Erik de Vries
edd4b785a5
actor_aggregation: updated to use future package for parallel processing as beta test for switching all parallel processing to future. Also disabled some of the aggregator output to save computation time
6 years ago
Erik de Vries
f8bc53006d
actor_aggregation: added sentiment analysis support for generating aggregations
6 years ago
Erik de Vries
d3d4045f1c
actor_aggregation: added sentence count to output, and changed occurences to count instead of mean, changed prom and rel_first to prom_art and rel_first_art, changed output filename to include function
6 years ago
Erik de Vries
176a8f6de4
elasticizer: added additional verbosity on errors
6 years ago
Erik de Vries
d420b02c20
elasticizer: Added more verbosity to investigate error handling
6 years ago
Erik de Vries
48b589dda0
query_gen_actors: reset to original state
6 years ago
Erik de Vries
7a01a7f18d
query_gen_actors: temporary update for fixing broken shit
6 years ago
Erik de Vries
45da9dd929
aggregator_elastic: revert to single-core lapply, due to sendMaster errors
6 years ago
Erik de Vries
f8e4111e70
aggregator_elastic: correct partyid implementation
6 years ago
Erik de Vries
c047a4a1db
aggregator_elastic: explicit reference to aggregator function
6 years ago
Erik de Vries
0d81d6fc7a
added aggregator and aggregator_elastic functions for aggregating and storing article level actor aggregations
6 years ago
Erik de Vries
2281d11a68
actor_aggregation: fixed filenaming of .Rds files
6 years ago
Erik de Vries
d9f28a46d8
actor_aggregation: small fixes to code
6 years ago
Erik de Vries
a29d04dacd
actorizer: fixed handling of empty results due to regex filtering
6 years ago
Erik de Vries
8e920f5f37
elasticizer: removed idiotic 15min sleep time after 500 batches
6 years ago
Erik de Vries
a11d7728ea
actor_aggregation: only aggregate scores on non-junk articles
6 years ago
Erik de Vries
54a70c47a0
actor_aggregation: removed timeout for parallel processing, requires fix in elasticizer (cannot recycle the same connection)
6 years ago
Erik de Vries
58fce4d560
actor_aggregation: added randomized short sleep, to allow for parallel execution
6 years ago
Erik de Vries
e3b26c0be3
actor_aggregation: Added function to generate aggregate actor measures at daily, weekly, monthly and yearly level
...
query_string: Added default_operator parameter, to define whether whitespaces should be interpreted as AND or OR, defaults to AND
6 years ago
Erik de Vries
28989f2bc4
dfm_gen: yet another fix for codes
6 years ago
Erik de Vries
0757b6bf8b
dfm_gen: re-added codes variable
6 years ago
Erik de Vries
2fc48cc2f7
dfm_gen: fixed absence of out$codes field
6 years ago
Erik de Vries
b249ff22de
dfm_gen.R: fixed junk mutation
6 years ago
Erik de Vries
0d05765ca7
dfm_gen: removed last remains of summer sample exceptions
6 years ago
Erik de Vries
e199b23227
dfm_gen: removed exceptions for NO summer codes
...
modelizer: created exception for outer_folds = 1
query_string: added parameter for default_operator
6 years ago
Erik de Vries
fbd525dc2e
modelizer: updated outer cross validation procedure to output raw prediction and true values, instead of processed and aggregated confusion matrix results
6 years ago
Erik de Vries
6a94bc3ed8
query_gen_actors: removed quotation marks from Minister search part
6 years ago
Erik de Vries
8d19333e59
query_gen_actors: changed script order for belgium exceptions
6 years ago
Erik de Vries
3bfe61e425
query_gen_actors: fixed implementation of Belgian exceptions
6 years ago
Erik de Vries
81697345cb
modelizer: removed breaking code
6 years ago
Erik de Vries
9ca952ca89
elastic_update: removed wait_for from url
6 years ago
Erik de Vries
8051a81b66
actorizer, dfm_gen, modelizer, out_parser: replaced all instances of detectCores by cores parameter (which defaults to detectCores)
6 years ago
Erik de Vries
ac37d836f5
elasticizer: added scroll_clear to null hits as well
6 years ago
Erik de Vries
75623856f7
elasticizer: updated scroll_clear to use conn object
6 years ago
Erik de Vries
c2d666c81d
bogus commit
6 years ago
Erik de Vries
e34460bf0f
elasticizer: clear scroll context when finishing query
6 years ago
Erik de Vries
9bd526fee0
elasticizer: fixed compatibility issues with elastic v1.0.0
6 years ago
Erik de Vries
f2312f65d5
elasticizer: update to account for syntax change in newer package versions
6 years ago
Erik de Vries
f6006eb9ba
actorizer: simplified pre/postfix check, only for NA, replace empty strings by NA beforehand
6 years ago
Erik de Vries
298099a4e6
actorizer: fix to deal with empty updates (ie dont do an update)
6 years ago
Erik de Vries
6961c0b866
query_gen_actors: updated actorid filter to use the keyword subfield
6 years ago
Erik de Vries
703b5e59a4
actorizer: fixed exceptionizer by adding whitespace before and after sentence, which is necessary because of negative regex (match anything before or after the highlight string that is NOT x actually requires something to be in front or after)
6 years ago
Erik de Vries
593d2de6e2
actorizer: add pre_tags and post_tags to argument list
...
bulk_writer: updated to use _doc doctype
query_gen_actors: added NA for all searches that don't have pre- or postfixes
6 years ago
Erik de Vries
a1b6c6a7cb
actorizer, query_gen_actors: revamped actor searches entirely
...
elasticizer: updated script for use with ES 7.x
6 years ago
Erik de Vries
88fc4ec53c
dfm_gen: changed out_parser call to mamlr:::out_parser
6 years ago
Erik de Vries
90fdbcc982
out_parser: parallelized when not in windoze
6 years ago
Erik de Vries
6414f759bd
actorizer: parallelized calculation of marker positions
6 years ago
Erik de Vries
522c872dba
out_parser: moved cleaning regex to end of pipeline, to prevent collissions with other (mandatory) regex cleaning
6 years ago
Erik de Vries
5b9793cd8c
actorizer: removed nested mclapply
6 years ago
Erik de Vries
1a4ba19546
actorizer: Removed udmodel dependencies, commented code, changed nested lists to flat lists
...
bulk_writer: changed handling of single-row dataframe parsing to JSON
elastic_update: changed function to return instead of print appData on error
ud_update: Changed nested lists to flat lists, and added start and end character positions
6 years ago
Erik de Vries
3abc3056e0
actorizer: fix to columns selected for actors variable, removed udmodel requirement
6 years ago
Erik de Vries
41c86ea116
actorizer, ud_update: Updated ud parsing and actorizer to work based on character positions. This code is used for local testing
6 years ago
Erik de Vries
eae1a22609
actorizer: update to use '|||' as highlight indicator, and set up ud output merging accordingly
6 years ago
Erik de Vries
5665b6d622
actorizer: more fixes to punctuation
6 years ago
Erik de Vries
cd05733648
actorizer: Additional fix for missing punctuation (see previous commit)
6 years ago
Erik de Vries
09732a1b5a
actorizer: quick fix for problem where original UK UD output does not have a dot at the end of the document, but the actor output does (old vs new parsing)
6 years ago