Erik de Vries
703b5e59a4
actorizer: fixed exceptionizer by adding whitespace before and after sentence, which is necessary because of negative regex (match anything before or after the highlight string that is NOT x actually requires something to be in front or after)
6 years ago
Erik de Vries
593d2de6e2
actorizer: add pre_tags and post_tags to argument list
...
bulk_writer: updated to use _doc doctype
query_gen_actors: added NA for all searches that don't have pre- or postfixes
6 years ago
Erik de Vries
a1b6c6a7cb
actorizer, query_gen_actors: revamped actor searches entirely
...
elasticizer: updated script for use with ES 7.x
6 years ago
Erik de Vries
88fc4ec53c
dfm_gen: changed out_parser call to mamlr:::out_parser
6 years ago
Erik de Vries
90fdbcc982
out_parser: parallelized when not in windoze
6 years ago
Erik de Vries
6414f759bd
actorizer: parallelized calculation of marker positions
6 years ago
Erik de Vries
522c872dba
out_parser: moved cleaning regex to end of pipeline, to prevent collissions with other (mandatory) regex cleaning
6 years ago
Erik de Vries
5b9793cd8c
actorizer: removed nested mclapply
6 years ago
Erik de Vries
1a4ba19546
actorizer: Removed udmodel dependencies, commented code, changed nested lists to flat lists
...
bulk_writer: changed handling of single-row dataframe parsing to JSON
elastic_update: changed function to return instead of print appData on error
ud_update: Changed nested lists to flat lists, and added start and end character positions
6 years ago
Erik de Vries
3abc3056e0
actorizer: fix to columns selected for actors variable, removed udmodel requirement
6 years ago
Erik de Vries
41c86ea116
actorizer, ud_update: Updated ud parsing and actorizer to work based on character positions. This code is used for local testing
6 years ago
Erik de Vries
eae1a22609
actorizer: update to use '|||' as highlight indicator, and set up ud output merging accordingly
6 years ago
Erik de Vries
5665b6d622
actorizer: more fixes to punctuation
6 years ago
Erik de Vries
cd05733648
actorizer: Additional fix for missing punctuation (see previous commit)
6 years ago
Erik de Vries
09732a1b5a
actorizer: quick fix for problem where original UK UD output does not have a dot at the end of the document, but the actor output does (old vs new parsing)
6 years ago
Erik de Vries
835d2332bc
actorizer: now uses the original udpipe output for sentence and token ids. When the actorized and original udpipe output do not have the same number of rows, it prints an error and sets err to TRUE in actorDetails
6 years ago
Erik de Vries
e70b6ccf7a
actorizer: fixed sentence_count and out_parser calls
...
out_parser: Added comment with old regex
6 years ago
Erik de Vries
9b0ac775af
class_update: add ver variable to set version for class updated articles
6 years ago
Erik de Vries
85306007f4
class_update: added words and clean parameters, in addition to text parameter, to be able to set data preprocessing exactly the same as in the trained model
6 years ago
Erik de Vries
e110780ad5
merger: idiotic fix for a non-problem, see comment on line 32
6 years ago
Erik de Vries
ce5f812252
dfm_gen, merger: Added option for generating lemma_upos hybrids for merged field
...
merger: Added custom clean option (sometimes not cleaning is preferred, even with lemmas)
merger, out_parser: Updated regex for filtering out non-words to also include email addresses (containing both @ and .)
6 years ago
Erik de Vries
386ac42aee
lemma_writer: new function to write raw lemma's (without interpunction) to text file. Is structured as elasticizer update function (despite not updating anything on the server)
6 years ago
Erik de Vries
4407a99774
actorizer: fix to get actual number of sentence occurences of actor
6 years ago
Erik de Vries
96e869fa6b
actorizer: previous commit was wrong, only add is an option, removed type variable
6 years ago
Erik de Vries
98219c807c
actorizer: Added type option, to choose between setting or adding to the actor variables, defaults to add (should normally not be changed)
6 years ago
Erik de Vries
e3b57ed9e3
actorizer: added clean = F to have the exact same behavior in ud_update and actorizer
6 years ago
Erik de Vries
7218f6b8d0
dupe_detect: fixed error on no duplicates
6 years ago
Erik de Vries
b9be372543
dupe_detect: fix to get correct colnames from simil (disable stringsAsFactors and convert col values to numeric)
6 years ago
Erik de Vries
1955692346
dfm_gen, out_parser: updated documentation
...
dupe_detect: major fix to function, no longer using rownames for article ids
6 years ago
Erik de Vries
34531b0da8
out_parser: added option to clean output using regex to remove numbers and non-words
...
dfm_gen, ud_update: updated functions to make use of out_parser cleaning option
merger: updated regex for cleaning lemmatized output
6 years ago
Erik de Vries
5851c56369
query_string: updated check for fields value
6 years ago
Erik de Vries
4f8b1f2024
elasticizer: renamed size parameter to batch_size, created max_batch parameter to limit the number of results returned
...
query_string: renamed x parameter to query, added fields parameter to select what fields to return and random boolean parameter to define whether the returned results should be randomized
6 years ago
Erik de Vries
d0e9bf565b
dupe_detect: Reset the _delete value to 1
...
out_parser: fix to sentence parsing, add additional (empty) string at end of merged field, to make merged field end on .
6 years ago
Erik de Vries
ea8cfb071f
dupe_detect: updated _delete var to be 2 when delete is true
6 years ago
Erik de Vries
0a3bdb630b
actorizer, dfm_gen, ud_update: unified output parsing from _source and highlight fields into a single function (out_parser)
...
out_parser: function to parse raw text output into a single field, either from _source or highlight fields
dupe_detect: updated function to use 'ver' parameter for versioning
6 years ago
Erik de Vries
9e5a1e3354
ud_update: removed mc.preschedule = F
6 years ago
Erik de Vries
c7560d7e32
ud_update: Removed . at end of text, and added mc.preschedule = F for testing
6 years ago
Erik de Vries
37df81b8ff
ud_update: fixed merged output field to always contain an (extra) dot (period) at the end of the document
6 years ago
Erik de Vries
c32c9e5ad3
ud_update: fix to deal with non-existing column names
6 years ago
Erik de Vries
8ffbddc073
actorizer, ud_update: implemented 'ver' variable for keeping track of updates
6 years ago
Erik de Vries
ae23456736
actorizer, ud_update: Updated merging of document fields to properly deal with missing punctuation at the end of fields (e.g. a title without punctuation at the end of the string)
...
modelizer: Minor update to feature keyness, using absolute values now to determine the most informative features for a class (so features that are either strongly postively or negatively related to the class)
bulk_writer: Added the 'ver' parameter to include a short version string with each update. Mostly to deal with updates that do not complete successfully on all data
6 years ago
Erik de Vries
9f3418ef37
class_update; dfm_gen; merger: updated functions to accept text parameter for both old style 'lemmas' and new style 'ud'
6 years ago
Erik de Vries
85aab558e0
bulk_writer: added clause to varname==ud update to also remove the tokens variable from source
6 years ago
Erik de Vries
54dfb6a8ca
actorizer: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them
...
ud_update: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them
elastic_update: set the minimum break between retries from 10 to 30 seconds
elasticizer: implementation of retries for elasticizer function, 10 retries with a break of 30 seconds in between
6 years ago
Erik de Vries
8caf53b90a
actorizer: switched to single core processing for debugging
6 years ago
Erik de Vries
c63409238b
actorizer: print row numbers for debugging
6 years ago
Erik de Vries
39005c7518
elasticizer: Updated bulk size to 1024 (a power of 2) and set a timeout of 900s every 500000 updates
...
query_gen_actors: Added an additional generator for the "Institution" type (for EU support)
actorizer: Created an updater function to search for actors and use UDPipe to parse the results
6 years ago
Erik de Vries
a3c3651c79
elasticizer: updated scroll time to be longer than the timeouts every 200000 articles (so 20m scroll time, 900s (15m) timeout)
6 years ago
Erik de Vries
4ad5357e15
elasticizer: Added 900s timeout after every batch of 200000 articles when updating, to allow ES to do some segment merges (and clean up disk space)
6 years ago
Erik de Vries
a5ba00146f
modelizer: fixed error when only one class is predicted for junk classification (borderline case)
6 years ago
Erik de Vries
a13d86b92d
modelizer: added some more debug output
6 years ago
Erik de Vries
23658ce51a
test
6 years ago
Erik de Vries
17cf6d04e9
modelizer: debug update
6 years ago
Erik de Vries
7544e5323f
modelizer: update to allow tf both as count (for naive bayes), and as proportion (for other machine learning algorithms)
6 years ago
Erik de Vries
5f5e4a03c8
modelizer: Changed tf-idf weighting from absolute tf count to proportional (normalized) tf! Also added initial support for neural networks
6 years ago
Erik de Vries
34a6adf64e
changed udpipe output variable from tokens to ud
6 years ago
Erik de Vries
061da17c2a
ud_update: Added function to lemmatize documents
6 years ago
Erik de Vries
ef51ce60a9
Fixed dupe_detect error on documents with one sentence or less, and a maximum # of words in dfm_gen
6 years ago
Erik de Vries
0e8c127b86
bulk_writer: fixes for JSON generation and added exception for use of 'tokens' varname
...
class_update/elastic_update: Moved response checking to elastic_update
dupe_detect: Finalized dupe_detect
6 years ago
Erik de Vries
755a58d84d
dupe_detect: fix to prevent errors when a query returns no results
6 years ago
Erik de Vries
887f1aa774
dupe_detect: fix for empty results dataframe (no duplicates for given date and newspaper)
6 years ago
Erik de Vries
993f39957a
dfm_gen: word cutoff now as final step in script, caused bugs with mutating code variables
6 years ago
Erik de Vries
02b8a8c1da
dfm_gen & merger: Changed word cutoff point to be a general setting in dfm_gen. Cuts off at the last [.?!] before the cutoff point (so returns documents at a sentence, shorter than cutoff).
6 years ago
Erik de Vries
4a713ddc23
bulk_writer: setting names(x) <- NULL when there is only one value (list or otherwise) to be updated.
...
This is because R apply treats rows of single values as a matrix, while it treats rows containing lists as (named) list. This has the nasty result of getting subvalues when using to JSON. i.e. computerCodes.actors = [list, of, ids] becomes computerCodes.actors.ids = [list, of, ids].
6 years ago
Erik de Vries
6bb8f9b635
class_update: added explicit httr::: references
6 years ago
Erik de Vries
f543d658bd
Major overhaul to ES bulk update integration. Added support for both setting and appending to variables
6 years ago
Erik de Vries
4adae2bbc6
Fixed bug in dupe_detect caused by switch from cutoff to cutoff_lower/upper
6 years ago
Erik de Vries
4cd46d1a5e
dupe_detect: added support for both lower and upper cutoff point
6 years ago
Erik de Vries
11d8b31c60
Added generic actor search query generator. Updated elasticizer and elastic_update to connect either to the remote server, or a local ES instance
6 years ago
Erik de Vries
3e66c7e1cd
Updated dfm_gen to have all topic vectors as numeric variables
6 years ago
Erik de Vries
adc4b3c639
Updated feature selection in modelizer function (see comment on lines 166/167)
6 years ago
Erik de Vries
65f8c26ec6
Renamed dupe_detect, and added return output
6 years ago
Erik de Vries
db418d7396
Add query_string function for generating query_string queries
6 years ago
Erik de Vries
d203de0b2a
Updated elasticizer docs, created modelizer and class_update functions
6 years ago
Erik de Vries
c815dc7f2b
Duplicate detection first commit
6 years ago
Erik de Vries
015411feaf
Added refresh=wait_for to bulk update url. This should make update scripts less demanding on the server side, because the server only replies after refreshing (happens every second)
6 years ago
Erik de Vries
413ad02a87
Set default to "lemmas" for dfm_gen
6 years ago
Erik de Vries
217ee76568
V 0.1 for elasticizer function with updater support
6 years ago
Erik de Vries
a273524105
Added support for custom update function to elasticizer
6 years ago
Erik de Vries
311838b34b
Updated dfm_gen to only create derivative codes if majorTopic actually exists, and set docvars to NULL when no majorTopic codes
6 years ago
Erik de Vries
dc4daf9de4
Added line to replace multiple whitespace characters in full text by a single regular whitespace
6 years ago
Erik de Vries
0e45c0f2d1
Added option for fulltext vs lemmas merged field
6 years ago
Erik de Vries
4bbe84ab83
First release of mamlr package
6 years ago