mamlr

edevries

mamlr

Archived

Author	SHA1	Message	Date
Erik de Vries	cd05733648	actorizer: Additional fix for missing punctuation (see previous commit)	7 years ago
Erik de Vries	09732a1b5a	actorizer: quick fix for problem where original UK UD output does not have a dot at the end of the document, but the actor output does (old vs new parsing)	7 years ago
Erik de Vries	835d2332bc	actorizer: now uses the original udpipe output for sentence and token ids. When the actorized and original udpipe output do not have the same number of rows, it prints an error and sets err to TRUE in actorDetails	7 years ago
Erik de Vries	e70b6ccf7a	actorizer: fixed sentence_count and out_parser calls out_parser: Added comment with old regex	7 years ago
Erik de Vries	9b0ac775af	class_update: add ver variable to set version for class updated articles	7 years ago
Erik de Vries	85306007f4	class_update: added words and clean parameters, in addition to text parameter, to be able to set data preprocessing exactly the same as in the trained model	7 years ago
Erik de Vries	e110780ad5	merger: idiotic fix for a non-problem, see comment on line 32	7 years ago
Erik de Vries	ce5f812252	dfm_gen, merger: Added option for generating lemma_upos hybrids for merged field merger: Added custom clean option (sometimes not cleaning is preferred, even with lemmas) merger, out_parser: Updated regex for filtering out non-words to also include email addresses (containing both @ and .)	7 years ago
Erik de Vries	386ac42aee	lemma_writer: new function to write raw lemma's (without interpunction) to text file. Is structured as elasticizer update function (despite not updating anything on the server)	7 years ago
Erik de Vries	4407a99774	actorizer: fix to get actual number of sentence occurences of actor	7 years ago
Erik de Vries	96e869fa6b	actorizer: previous commit was wrong, only add is an option, removed type variable	7 years ago
Erik de Vries	98219c807c	actorizer: Added type option, to choose between setting or adding to the actor variables, defaults to add (should normally not be changed)	7 years ago
Erik de Vries	e3b57ed9e3	actorizer: added clean = F to have the exact same behavior in ud_update and actorizer	7 years ago
Erik de Vries	7218f6b8d0	dupe_detect: fixed error on no duplicates	7 years ago
Erik de Vries	b9be372543	dupe_detect: fix to get correct colnames from simil (disable stringsAsFactors and convert col values to numeric)	7 years ago
Erik de Vries	1955692346	dfm_gen, out_parser: updated documentation dupe_detect: major fix to function, no longer using rownames for article ids	7 years ago
Erik de Vries	34531b0da8	out_parser: added option to clean output using regex to remove numbers and non-words dfm_gen, ud_update: updated functions to make use of out_parser cleaning option merger: updated regex for cleaning lemmatized output	7 years ago
Erik de Vries	5851c56369	query_string: updated check for fields value	7 years ago
Erik de Vries	4f8b1f2024	elasticizer: renamed size parameter to batch_size, created max_batch parameter to limit the number of results returned query_string: renamed x parameter to query, added fields parameter to select what fields to return and random boolean parameter to define whether the returned results should be randomized	7 years ago
Erik de Vries	d0e9bf565b	dupe_detect: Reset the _delete value to 1 out_parser: fix to sentence parsing, add additional (empty) string at end of merged field, to make merged field end on .	7 years ago
Erik de Vries	ea8cfb071f	dupe_detect: updated _delete var to be 2 when delete is true	7 years ago
Erik de Vries	0a3bdb630b	actorizer, dfm_gen, ud_update: unified output parsing from _source and highlight fields into a single function (out_parser) out_parser: function to parse raw text output into a single field, either from _source or highlight fields dupe_detect: updated function to use 'ver' parameter for versioning	7 years ago
Erik de Vries	9e5a1e3354	ud_update: removed mc.preschedule = F	7 years ago
Erik de Vries	c7560d7e32	ud_update: Removed . at end of text, and added mc.preschedule = F for testing	7 years ago
Erik de Vries	37df81b8ff	ud_update: fixed merged output field to always contain an (extra) dot (period) at the end of the document	7 years ago
Erik de Vries	c32c9e5ad3	ud_update: fix to deal with non-existing column names	7 years ago
Erik de Vries	8ffbddc073	actorizer, ud_update: implemented 'ver' variable for keeping track of updates	7 years ago
Erik de Vries	ae23456736	actorizer, ud_update: Updated merging of document fields to properly deal with missing punctuation at the end of fields (e.g. a title without punctuation at the end of the string) modelizer: Minor update to feature keyness, using absolute values now to determine the most informative features for a class (so features that are either strongly postively or negatively related to the class) bulk_writer: Added the 'ver' parameter to include a short version string with each update. Mostly to deal with updates that do not complete successfully on all data	7 years ago
Erik de Vries	9f3418ef37	class_update; dfm_gen; merger: updated functions to accept text parameter for both old style 'lemmas' and new style 'ud'	7 years ago
Erik de Vries	85aab558e0	bulk_writer: added clause to varname==ud update to also remove the tokens variable from source	7 years ago
Erik de Vries	54dfb6a8ca	actorizer: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them ud_update: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them elastic_update: set the minimum break between retries from 10 to 30 seconds elasticizer: implementation of retries for elasticizer function, 10 retries with a break of 30 seconds in between	7 years ago
Erik de Vries	8caf53b90a	actorizer: switched to single core processing for debugging	7 years ago
Erik de Vries	c63409238b	actorizer: print row numbers for debugging	7 years ago
Erik de Vries	39005c7518	elasticizer: Updated bulk size to 1024 (a power of 2) and set a timeout of 900s every 500000 updates query_gen_actors: Added an additional generator for the "Institution" type (for EU support) actorizer: Created an updater function to search for actors and use UDPipe to parse the results	7 years ago
Erik de Vries	a3c3651c79	elasticizer: updated scroll time to be longer than the timeouts every 200000 articles (so 20m scroll time, 900s (15m) timeout)	7 years ago
Erik de Vries	4ad5357e15	elasticizer: Added 900s timeout after every batch of 200000 articles when updating, to allow ES to do some segment merges (and clean up disk space)	7 years ago
Erik de Vries	a5ba00146f	modelizer: fixed error when only one class is predicted for junk classification (borderline case)	7 years ago
Erik de Vries	a13d86b92d	modelizer: added some more debug output	7 years ago
Erik de Vries	23658ce51a	test	7 years ago
Erik de Vries	17cf6d04e9	modelizer: debug update	7 years ago
Erik de Vries	7544e5323f	modelizer: update to allow tf both as count (for naive bayes), and as proportion (for other machine learning algorithms)	7 years ago
Erik de Vries	5f5e4a03c8	modelizer: Changed tf-idf weighting from absolute tf count to proportional (normalized) tf! Also added initial support for neural networks	7 years ago
Erik de Vries	34a6adf64e	changed udpipe output variable from tokens to ud	7 years ago
Erik de Vries	061da17c2a	ud_update: Added function to lemmatize documents	7 years ago
Erik de Vries	ef51ce60a9	Fixed dupe_detect error on documents with one sentence or less, and a maximum # of words in dfm_gen	7 years ago
Erik de Vries	0e8c127b86	bulk_writer: fixes for JSON generation and added exception for use of 'tokens' varname class_update/elastic_update: Moved response checking to elastic_update dupe_detect: Finalized dupe_detect	7 years ago
Erik de Vries	755a58d84d	dupe_detect: fix to prevent errors when a query returns no results	7 years ago
Erik de Vries	887f1aa774	dupe_detect: fix for empty results dataframe (no duplicates for given date and newspaper)	7 years ago
Erik de Vries	993f39957a	dfm_gen: word cutoff now as final step in script, caused bugs with mutating code variables	7 years ago
Erik de Vries	02b8a8c1da	dfm_gen & merger: Changed word cutoff point to be a general setting in dfm_gen. Cuts off at the last [.?!] before the cutoff point (so returns documents at a sentence, shorter than cutoff).	7 years ago

1 2

70 Commits (cd05733648eef19b441823b6725d8ce92ef3460e)