Your Name
8eedec8bb5
actor_fetcher: added option for using dictionaries with just lemmas, besides the option of using lemma_upos dictionaries
5 years ago
Your Name
057d225a7a
actor_fetcher: Allow generation of actor df containing only specified actor ids and aggregations
5 years ago
Your Name
9eae486a80
separated data preprocessing routines
...
class_update: check if there are idf values associated with model, before applying weights
estimator: make use of preproc() function for data preprocessing
preproc: function containing all logic with regards to text data preprocessing and weighting
5 years ago
Your Name
a3b6e19646
revised modeling pipeline:
...
cv_generator: generate folds for nested cv
dfm_gen: added optional lowercasing parameter
estimator: estimate model and performance based on parameters
feat_select: select features based on textstat_keyness
metric_gen: convert output from estimator to model performance metrics
modelizer: updated for new pipeline
modelizer_old: old model pipeline
out_parser: now correctly exported
5 years ago
Your Name
e76a914dd2
actor_fetcher: Updated to tidyr 1.0.0, no longer using preserve, slightly different approach to keeping ids_list, and not removing actorsDetail anymore because it does not exist
5 years ago
Your Name
a01a53f105
class_update: added cores parameter for multicore processing of sources when using lemmas
5 years ago
Your Name
d9f936c566
modelizer: tf-idf application updated, final model now also includes idf values from training set, explicitly setting positive category in binary classification for confusion matrices, minor code fixes
...
dfm_gen: added old junk codes for recoding, and removed deprecated ngrams parameter from dfm function
class_update: removed dfm_words parameter, which is replaced by the force = T parameter in predict(), training/model idf is now applied to unseen data
DESCRIPTION: added quanteda.textmodels as new dependency, since these have been separated from base quanteda 2.0.0 onwards
5 years ago
Erik de Vries
06bfec71bc
lemma_writer: unlist lemmas before writing
5 years ago
Erik de Vries
a83ee5dfd0
lemma_writer: update to write lemma instead of full document text
5 years ago
Erik de Vries
e594185719
dfm_gen: set default cores to 1
5 years ago
Erik de Vries
889e7e92af
lemma_writer: updated to provide support for writing raw documents to individual files using utf-8 encoding
5 years ago
Erik de Vries
115297f597
actor_aggregation,aggregator,aggregator_elastic: moved out of package directory to Old
...
actor_fetcher: moved sentiment validation code block
5 years ago
Erik de Vries
3fcbbd1f1f
actor_fetch: fixed error where source.ud would not exist
5 years ago
Erik de Vries
674ef09e10
query_gen_actors: added junior minister check to if statement
5 years ago
Erik de Vries
853c117daf
actor_fetcher: change in code to keep original actorid lists in output
...
query_gen_actors: added code for junior ministers in BE and NL
5 years ago
Erik de Vries
bf3d11ffe0
query_gen_actors: various bugfixes and changes
5 years ago
Erik de Vries
99af1427f0
query_gen_actors: fixed scandinavian query generation
5 years ago
Erik de Vries
e49a4ae93e
query_gen_actors: fixed problem with too many brackets in query
5 years ago
Erik de Vries
060751237b
actorizer, out_parser: switched from mclapply to future_lapply and removed windows-specific code from out_parser
...
query_gen_actors: rewritten minister queries to only use proximity queries
5 years ago
Erik de Vries
d0601d2aa7
actor_fetcher: added minimum verbosity to identify cases in which an actor is present without a party mention
5 years ago
Erik de Vries
82ef165c5f
actor_fetcher: quick fix
5 years ago
Erik de Vries
9e433ecf9e
actor_fetcher: added handling of exception where all actorsids related to a party are individual actors
5 years ago
Erik de Vries
526270900c
actor_fetcher: integrated party merging into actor_fetcher in what hopefully is the most efficient way
5 years ago
Erik de Vries
84df9658ff
actor_fetcher: added lemma output when validating, to detect most problematic lemmas
5 years ago
Erik de Vries
499ee74f0d
actor_fetcher: fixed code error
5 years ago
Erik de Vries
a3e8dcf96e
actor_fetcher: switched from binary word sentiment scores to proximity scores (cosine similarity)
6 years ago
Erik de Vries
6f5ace8c52
actor_fetcher: elasticizer batch function to fetch actorsDetail fields from all relevant documents
6 years ago
Erik de Vries
edd4b785a5
actor_aggregation: updated to use future package for parallel processing as beta test for switching all parallel processing to future. Also disabled some of the aggregator output to save computation time
6 years ago
Erik de Vries
f8bc53006d
actor_aggregation: added sentiment analysis support for generating aggregations
6 years ago
Erik de Vries
d3d4045f1c
actor_aggregation: added sentence count to output, and changed occurences to count instead of mean, changed prom and rel_first to prom_art and rel_first_art, changed output filename to include function
6 years ago
Erik de Vries
176a8f6de4
elasticizer: added additional verbosity on errors
6 years ago
Erik de Vries
d420b02c20
elasticizer: Added more verbosity to investigate error handling
6 years ago
Erik de Vries
48b589dda0
query_gen_actors: reset to original state
6 years ago
Erik de Vries
7a01a7f18d
query_gen_actors: temporary update for fixing broken shit
6 years ago
Erik de Vries
45da9dd929
aggregator_elastic: revert to single-core lapply, due to sendMaster errors
6 years ago
Erik de Vries
f8e4111e70
aggregator_elastic: correct partyid implementation
6 years ago
Erik de Vries
c047a4a1db
aggregator_elastic: explicit reference to aggregator function
6 years ago
Erik de Vries
0d81d6fc7a
added aggregator and aggregator_elastic functions for aggregating and storing article level actor aggregations
6 years ago
Erik de Vries
2281d11a68
actor_aggregation: fixed filenaming of .Rds files
6 years ago
Erik de Vries
d9f28a46d8
actor_aggregation: small fixes to code
6 years ago
Erik de Vries
a29d04dacd
actorizer: fixed handling of empty results due to regex filtering
6 years ago
Erik de Vries
8e920f5f37
elasticizer: removed idiotic 15min sleep time after 500 batches
6 years ago
Erik de Vries
a11d7728ea
actor_aggregation: only aggregate scores on non-junk articles
6 years ago
Erik de Vries
54a70c47a0
actor_aggregation: removed timeout for parallel processing, requires fix in elasticizer (cannot recycle the same connection)
6 years ago
Erik de Vries
58fce4d560
actor_aggregation: added randomized short sleep, to allow for parallel execution
6 years ago
Erik de Vries
e3b26c0be3
actor_aggregation: Added function to generate aggregate actor measures at daily, weekly, monthly and yearly level
...
query_string: Added default_operator parameter, to define whether whitespaces should be interpreted as AND or OR, defaults to AND
6 years ago
Erik de Vries
28989f2bc4
dfm_gen: yet another fix for codes
6 years ago
Erik de Vries
0757b6bf8b
dfm_gen: re-added codes variable
6 years ago
Erik de Vries
2fc48cc2f7
dfm_gen: fixed absence of out$codes field
6 years ago
Erik de Vries
b249ff22de
dfm_gen.R: fixed junk mutation
6 years ago
Erik de Vries
0d05765ca7
dfm_gen: removed last remains of summer sample exceptions
6 years ago
Erik de Vries
e199b23227
dfm_gen: removed exceptions for NO summer codes
...
modelizer: created exception for outer_folds = 1
query_string: added parameter for default_operator
6 years ago
Erik de Vries
fbd525dc2e
modelizer: updated outer cross validation procedure to output raw prediction and true values, instead of processed and aggregated confusion matrix results
6 years ago
Erik de Vries
6a94bc3ed8
query_gen_actors: removed quotation marks from Minister search part
6 years ago
Erik de Vries
8d19333e59
query_gen_actors: changed script order for belgium exceptions
6 years ago
Erik de Vries
3bfe61e425
query_gen_actors: fixed implementation of Belgian exceptions
6 years ago
Erik de Vries
81697345cb
modelizer: removed breaking code
6 years ago
Erik de Vries
9ca952ca89
elastic_update: removed wait_for from url
6 years ago
Erik de Vries
8051a81b66
actorizer, dfm_gen, modelizer, out_parser: replaced all instances of detectCores by cores parameter (which defaults to detectCores)
6 years ago
Erik de Vries
ac37d836f5
elasticizer: added scroll_clear to null hits as well
6 years ago
Erik de Vries
75623856f7
elasticizer: updated scroll_clear to use conn object
6 years ago
Erik de Vries
c2d666c81d
bogus commit
6 years ago
Erik de Vries
e34460bf0f
elasticizer: clear scroll context when finishing query
6 years ago
Erik de Vries
9bd526fee0
elasticizer: fixed compatibility issues with elastic v1.0.0
6 years ago
Erik de Vries
f2312f65d5
elasticizer: update to account for syntax change in newer package versions
6 years ago
Erik de Vries
f6006eb9ba
actorizer: simplified pre/postfix check, only for NA, replace empty strings by NA beforehand
6 years ago
Erik de Vries
298099a4e6
actorizer: fix to deal with empty updates (ie dont do an update)
6 years ago
Erik de Vries
6961c0b866
query_gen_actors: updated actorid filter to use the keyword subfield
6 years ago
Erik de Vries
703b5e59a4
actorizer: fixed exceptionizer by adding whitespace before and after sentence, which is necessary because of negative regex (match anything before or after the highlight string that is NOT x actually requires something to be in front or after)
6 years ago
Erik de Vries
593d2de6e2
actorizer: add pre_tags and post_tags to argument list
...
bulk_writer: updated to use _doc doctype
query_gen_actors: added NA for all searches that don't have pre- or postfixes
6 years ago
Erik de Vries
a1b6c6a7cb
actorizer, query_gen_actors: revamped actor searches entirely
...
elasticizer: updated script for use with ES 7.x
6 years ago
Erik de Vries
88fc4ec53c
dfm_gen: changed out_parser call to mamlr:::out_parser
6 years ago
Erik de Vries
90fdbcc982
out_parser: parallelized when not in windoze
6 years ago
Erik de Vries
6414f759bd
actorizer: parallelized calculation of marker positions
6 years ago
Erik de Vries
522c872dba
out_parser: moved cleaning regex to end of pipeline, to prevent collissions with other (mandatory) regex cleaning
6 years ago
Erik de Vries
5b9793cd8c
actorizer: removed nested mclapply
6 years ago
Erik de Vries
1a4ba19546
actorizer: Removed udmodel dependencies, commented code, changed nested lists to flat lists
...
bulk_writer: changed handling of single-row dataframe parsing to JSON
elastic_update: changed function to return instead of print appData on error
ud_update: Changed nested lists to flat lists, and added start and end character positions
6 years ago
Erik de Vries
3abc3056e0
actorizer: fix to columns selected for actors variable, removed udmodel requirement
6 years ago
Erik de Vries
41c86ea116
actorizer, ud_update: Updated ud parsing and actorizer to work based on character positions. This code is used for local testing
6 years ago
Erik de Vries
eae1a22609
actorizer: update to use '|||' as highlight indicator, and set up ud output merging accordingly
6 years ago
Erik de Vries
5665b6d622
actorizer: more fixes to punctuation
6 years ago
Erik de Vries
cd05733648
actorizer: Additional fix for missing punctuation (see previous commit)
6 years ago
Erik de Vries
09732a1b5a
actorizer: quick fix for problem where original UK UD output does not have a dot at the end of the document, but the actor output does (old vs new parsing)
6 years ago
Erik de Vries
835d2332bc
actorizer: now uses the original udpipe output for sentence and token ids. When the actorized and original udpipe output do not have the same number of rows, it prints an error and sets err to TRUE in actorDetails
6 years ago
Erik de Vries
e70b6ccf7a
actorizer: fixed sentence_count and out_parser calls
...
out_parser: Added comment with old regex
6 years ago
Erik de Vries
9b0ac775af
class_update: add ver variable to set version for class updated articles
6 years ago
Erik de Vries
85306007f4
class_update: added words and clean parameters, in addition to text parameter, to be able to set data preprocessing exactly the same as in the trained model
6 years ago
Erik de Vries
e110780ad5
merger: idiotic fix for a non-problem, see comment on line 32
6 years ago
Erik de Vries
ce5f812252
dfm_gen, merger: Added option for generating lemma_upos hybrids for merged field
...
merger: Added custom clean option (sometimes not cleaning is preferred, even with lemmas)
merger, out_parser: Updated regex for filtering out non-words to also include email addresses (containing both @ and .)
6 years ago
Erik de Vries
386ac42aee
lemma_writer: new function to write raw lemma's (without interpunction) to text file. Is structured as elasticizer update function (despite not updating anything on the server)
6 years ago
Erik de Vries
4407a99774
actorizer: fix to get actual number of sentence occurences of actor
6 years ago
Erik de Vries
96e869fa6b
actorizer: previous commit was wrong, only add is an option, removed type variable
6 years ago
Erik de Vries
98219c807c
actorizer: Added type option, to choose between setting or adding to the actor variables, defaults to add (should normally not be changed)
6 years ago
Erik de Vries
e3b57ed9e3
actorizer: added clean = F to have the exact same behavior in ud_update and actorizer
6 years ago
Erik de Vries
7218f6b8d0
dupe_detect: fixed error on no duplicates
6 years ago
Erik de Vries
b9be372543
dupe_detect: fix to get correct colnames from simil (disable stringsAsFactors and convert col values to numeric)
6 years ago
Erik de Vries
1955692346
dfm_gen, out_parser: updated documentation
...
dupe_detect: major fix to function, no longer using rownames for article ids
6 years ago
Erik de Vries
34531b0da8
out_parser: added option to clean output using regex to remove numbers and non-words
...
dfm_gen, ud_update: updated functions to make use of out_parser cleaning option
merger: updated regex for cleaning lemmatized output
6 years ago
Erik de Vries
5851c56369
query_string: updated check for fields value
6 years ago
Erik de Vries
4f8b1f2024
elasticizer: renamed size parameter to batch_size, created max_batch parameter to limit the number of results returned
...
query_string: renamed x parameter to query, added fields parameter to select what fields to return and random boolean parameter to define whether the returned results should be randomized
6 years ago