Erik de Vries
e70b6ccf7a
actorizer: fixed sentence_count and out_parser calls
...
out_parser: Added comment with old regex
6 years ago
Erik de Vries
9b0ac775af
class_update: add ver variable to set version for class updated articles
6 years ago
Erik de Vries
85306007f4
class_update: added words and clean parameters, in addition to text parameter, to be able to set data preprocessing exactly the same as in the trained model
6 years ago
Erik de Vries
e110780ad5
merger: idiotic fix for a non-problem, see comment on line 32
6 years ago
Erik de Vries
ce5f812252
dfm_gen, merger: Added option for generating lemma_upos hybrids for merged field
...
merger: Added custom clean option (sometimes not cleaning is preferred, even with lemmas)
merger, out_parser: Updated regex for filtering out non-words to also include email addresses (containing both @ and .)
6 years ago
Erik de Vries
386ac42aee
lemma_writer: new function to write raw lemma's (without interpunction) to text file. Is structured as elasticizer update function (despite not updating anything on the server)
6 years ago
Erik de Vries
4407a99774
actorizer: fix to get actual number of sentence occurences of actor
6 years ago
Erik de Vries
96e869fa6b
actorizer: previous commit was wrong, only add is an option, removed type variable
6 years ago
Erik de Vries
98219c807c
actorizer: Added type option, to choose between setting or adding to the actor variables, defaults to add (should normally not be changed)
6 years ago
Erik de Vries
e3b57ed9e3
actorizer: added clean = F to have the exact same behavior in ud_update and actorizer
6 years ago
Erik de Vries
7218f6b8d0
dupe_detect: fixed error on no duplicates
6 years ago
Erik de Vries
b9be372543
dupe_detect: fix to get correct colnames from simil (disable stringsAsFactors and convert col values to numeric)
6 years ago
Erik de Vries
1955692346
dfm_gen, out_parser: updated documentation
...
dupe_detect: major fix to function, no longer using rownames for article ids
6 years ago
Erik de Vries
34531b0da8
out_parser: added option to clean output using regex to remove numbers and non-words
...
dfm_gen, ud_update: updated functions to make use of out_parser cleaning option
merger: updated regex for cleaning lemmatized output
6 years ago
Erik de Vries
5851c56369
query_string: updated check for fields value
6 years ago
Erik de Vries
4f8b1f2024
elasticizer: renamed size parameter to batch_size, created max_batch parameter to limit the number of results returned
...
query_string: renamed x parameter to query, added fields parameter to select what fields to return and random boolean parameter to define whether the returned results should be randomized
6 years ago
Erik de Vries
d0e9bf565b
dupe_detect: Reset the _delete value to 1
...
out_parser: fix to sentence parsing, add additional (empty) string at end of merged field, to make merged field end on .
6 years ago
Erik de Vries
ea8cfb071f
dupe_detect: updated _delete var to be 2 when delete is true
6 years ago
Erik de Vries
0a3bdb630b
actorizer, dfm_gen, ud_update: unified output parsing from _source and highlight fields into a single function (out_parser)
...
out_parser: function to parse raw text output into a single field, either from _source or highlight fields
dupe_detect: updated function to use 'ver' parameter for versioning
6 years ago
Erik de Vries
9e5a1e3354
ud_update: removed mc.preschedule = F
6 years ago
Erik de Vries
c7560d7e32
ud_update: Removed . at end of text, and added mc.preschedule = F for testing
6 years ago
Erik de Vries
37df81b8ff
ud_update: fixed merged output field to always contain an (extra) dot (period) at the end of the document
6 years ago
Erik de Vries
c32c9e5ad3
ud_update: fix to deal with non-existing column names
6 years ago
Erik de Vries
8ffbddc073
actorizer, ud_update: implemented 'ver' variable for keeping track of updates
6 years ago
Erik de Vries
ae23456736
actorizer, ud_update: Updated merging of document fields to properly deal with missing punctuation at the end of fields (e.g. a title without punctuation at the end of the string)
...
modelizer: Minor update to feature keyness, using absolute values now to determine the most informative features for a class (so features that are either strongly postively or negatively related to the class)
bulk_writer: Added the 'ver' parameter to include a short version string with each update. Mostly to deal with updates that do not complete successfully on all data
6 years ago
Erik de Vries
9f3418ef37
class_update; dfm_gen; merger: updated functions to accept text parameter for both old style 'lemmas' and new style 'ud'
6 years ago
Erik de Vries
85aab558e0
bulk_writer: added clause to varname==ud update to also remove the tokens variable from source
6 years ago
Erik de Vries
54dfb6a8ca
actorizer: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them
...
ud_update: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them
elastic_update: set the minimum break between retries from 10 to 30 seconds
elasticizer: implementation of retries for elasticizer function, 10 retries with a break of 30 seconds in between
6 years ago
Erik de Vries
8caf53b90a
actorizer: switched to single core processing for debugging
6 years ago
Erik de Vries
c63409238b
actorizer: print row numbers for debugging
6 years ago
Erik de Vries
39005c7518
elasticizer: Updated bulk size to 1024 (a power of 2) and set a timeout of 900s every 500000 updates
...
query_gen_actors: Added an additional generator for the "Institution" type (for EU support)
actorizer: Created an updater function to search for actors and use UDPipe to parse the results
6 years ago
Erik de Vries
a3c3651c79
elasticizer: updated scroll time to be longer than the timeouts every 200000 articles (so 20m scroll time, 900s (15m) timeout)
6 years ago
Erik de Vries
4ad5357e15
elasticizer: Added 900s timeout after every batch of 200000 articles when updating, to allow ES to do some segment merges (and clean up disk space)
6 years ago
Erik de Vries
a5ba00146f
modelizer: fixed error when only one class is predicted for junk classification (borderline case)
6 years ago
Erik de Vries
a13d86b92d
modelizer: added some more debug output
6 years ago
Erik de Vries
23658ce51a
test
6 years ago
Erik de Vries
17cf6d04e9
modelizer: debug update
6 years ago
Erik de Vries
7544e5323f
modelizer: update to allow tf both as count (for naive bayes), and as proportion (for other machine learning algorithms)
6 years ago
Erik de Vries
5f5e4a03c8
modelizer: Changed tf-idf weighting from absolute tf count to proportional (normalized) tf! Also added initial support for neural networks
6 years ago
Erik de Vries
34a6adf64e
changed udpipe output variable from tokens to ud
6 years ago
Erik de Vries
061da17c2a
ud_update: Added function to lemmatize documents
6 years ago
Erik de Vries
ef51ce60a9
Fixed dupe_detect error on documents with one sentence or less, and a maximum # of words in dfm_gen
6 years ago
Erik de Vries
0e8c127b86
bulk_writer: fixes for JSON generation and added exception for use of 'tokens' varname
...
class_update/elastic_update: Moved response checking to elastic_update
dupe_detect: Finalized dupe_detect
6 years ago
Erik de Vries
755a58d84d
dupe_detect: fix to prevent errors when a query returns no results
6 years ago
Erik de Vries
887f1aa774
dupe_detect: fix for empty results dataframe (no duplicates for given date and newspaper)
6 years ago
Erik de Vries
993f39957a
dfm_gen: word cutoff now as final step in script, caused bugs with mutating code variables
6 years ago
Erik de Vries
02b8a8c1da
dfm_gen & merger: Changed word cutoff point to be a general setting in dfm_gen. Cuts off at the last [.?!] before the cutoff point (so returns documents at a sentence, shorter than cutoff).
6 years ago
Erik de Vries
4a713ddc23
bulk_writer: setting names(x) <- NULL when there is only one value (list or otherwise) to be updated.
...
This is because R apply treats rows of single values as a matrix, while it treats rows containing lists as (named) list. This has the nasty result of getting subvalues when using to JSON. i.e. computerCodes.actors = [list, of, ids] becomes computerCodes.actors.ids = [list, of, ids].
6 years ago
Erik de Vries
6bb8f9b635
class_update: added explicit httr::: references
6 years ago
Erik de Vries
f543d658bd
Major overhaul to ES bulk update integration. Added support for both setting and appending to variables
6 years ago
Erik de Vries
4adae2bbc6
Fixed bug in dupe_detect caused by switch from cutoff to cutoff_lower/upper
6 years ago
Erik de Vries
4cd46d1a5e
dupe_detect: added support for both lower and upper cutoff point
6 years ago
Erik de Vries
11d8b31c60
Added generic actor search query generator. Updated elasticizer and elastic_update to connect either to the remote server, or a local ES instance
6 years ago
Erik de Vries
3e66c7e1cd
Updated dfm_gen to have all topic vectors as numeric variables
6 years ago
Erik de Vries
adc4b3c639
Updated feature selection in modelizer function (see comment on lines 166/167)
6 years ago
Erik de Vries
65f8c26ec6
Renamed dupe_detect, and added return output
6 years ago
Erik de Vries
db418d7396
Add query_string function for generating query_string queries
6 years ago
Erik de Vries
d203de0b2a
Updated elasticizer docs, created modelizer and class_update functions
6 years ago
Erik de Vries
c815dc7f2b
Duplicate detection first commit
6 years ago
Erik de Vries
015411feaf
Added refresh=wait_for to bulk update url. This should make update scripts less demanding on the server side, because the server only replies after refreshing (happens every second)
6 years ago
Erik de Vries
413ad02a87
Set default to "lemmas" for dfm_gen
6 years ago
Erik de Vries
217ee76568
V 0.1 for elasticizer function with updater support
6 years ago
Erik de Vries
a273524105
Added support for custom update function to elasticizer
6 years ago
Erik de Vries
311838b34b
Updated dfm_gen to only create derivative codes if majorTopic actually exists, and set docvars to NULL when no majorTopic codes
6 years ago
Erik de Vries
dc4daf9de4
Added line to replace multiple whitespace characters in full text by a single regular whitespace
6 years ago
Erik de Vries
0e45c0f2d1
Added option for fulltext vs lemmas merged field
6 years ago
Erik de Vries
4bbe84ab83
First release of mamlr package
6 years ago