mamlr

edevries

mamlr

Archived

Author	SHA1	Message	Date
Erik de Vries	9b0ac775af	class_update: add ver variable to set version for class updated articles	6 years ago
Erik de Vries	85306007f4	class_update: added words and clean parameters, in addition to text parameter, to be able to set data preprocessing exactly the same as in the trained model	6 years ago
Erik de Vries	e110780ad5	merger: idiotic fix for a non-problem, see comment on line 32	6 years ago
Erik de Vries	ce5f812252	dfm_gen, merger: Added option for generating lemma_upos hybrids for merged field merger: Added custom clean option (sometimes not cleaning is preferred, even with lemmas) merger, out_parser: Updated regex for filtering out non-words to also include email addresses (containing both @ and .)	6 years ago
Erik de Vries	386ac42aee	lemma_writer: new function to write raw lemma's (without interpunction) to text file. Is structured as elasticizer update function (despite not updating anything on the server)	6 years ago
Erik de Vries	4407a99774	actorizer: fix to get actual number of sentence occurences of actor	6 years ago
Erik de Vries	96e869fa6b	actorizer: previous commit was wrong, only add is an option, removed type variable	6 years ago
Erik de Vries	98219c807c	actorizer: Added type option, to choose between setting or adding to the actor variables, defaults to add (should normally not be changed)	6 years ago
Erik de Vries	e3b57ed9e3	actorizer: added clean = F to have the exact same behavior in ud_update and actorizer	6 years ago
Erik de Vries	7218f6b8d0	dupe_detect: fixed error on no duplicates	6 years ago
Erik de Vries	b9be372543	dupe_detect: fix to get correct colnames from simil (disable stringsAsFactors and convert col values to numeric)	6 years ago
Erik de Vries	1955692346	dfm_gen, out_parser: updated documentation dupe_detect: major fix to function, no longer using rownames for article ids	6 years ago
Erik de Vries	34531b0da8	out_parser: added option to clean output using regex to remove numbers and non-words dfm_gen, ud_update: updated functions to make use of out_parser cleaning option merger: updated regex for cleaning lemmatized output	6 years ago
Erik de Vries	5851c56369	query_string: updated check for fields value	7 years ago
Erik de Vries	4f8b1f2024	elasticizer: renamed size parameter to batch_size, created max_batch parameter to limit the number of results returned query_string: renamed x parameter to query, added fields parameter to select what fields to return and random boolean parameter to define whether the returned results should be randomized	7 years ago
Erik de Vries	d0e9bf565b	dupe_detect: Reset the _delete value to 1 out_parser: fix to sentence parsing, add additional (empty) string at end of merged field, to make merged field end on .	7 years ago
Erik de Vries	ea8cfb071f	dupe_detect: updated _delete var to be 2 when delete is true	7 years ago
Erik de Vries	0a3bdb630b	actorizer, dfm_gen, ud_update: unified output parsing from _source and highlight fields into a single function (out_parser) out_parser: function to parse raw text output into a single field, either from _source or highlight fields dupe_detect: updated function to use 'ver' parameter for versioning	7 years ago
Erik de Vries	9e5a1e3354	ud_update: removed mc.preschedule = F	7 years ago
Erik de Vries	c7560d7e32	ud_update: Removed . at end of text, and added mc.preschedule = F for testing	7 years ago
Erik de Vries	37df81b8ff	ud_update: fixed merged output field to always contain an (extra) dot (period) at the end of the document	7 years ago
Erik de Vries	c32c9e5ad3	ud_update: fix to deal with non-existing column names	7 years ago
Erik de Vries	8ffbddc073	actorizer, ud_update: implemented 'ver' variable for keeping track of updates	7 years ago
Erik de Vries	ae23456736	actorizer, ud_update: Updated merging of document fields to properly deal with missing punctuation at the end of fields (e.g. a title without punctuation at the end of the string) modelizer: Minor update to feature keyness, using absolute values now to determine the most informative features for a class (so features that are either strongly postively or negatively related to the class) bulk_writer: Added the 'ver' parameter to include a short version string with each update. Mostly to deal with updates that do not complete successfully on all data	7 years ago
Erik de Vries	9f3418ef37	class_update; dfm_gen; merger: updated functions to accept text parameter for both old style 'lemmas' and new style 'ud'	7 years ago
Erik de Vries	85aab558e0	bulk_writer: added clause to varname==ud update to also remove the tokens variable from source	7 years ago
Erik de Vries	581e7b2929	DESCRIPTION: added SparseM as required package	7 years ago
Erik de Vries	54dfb6a8ca	actorizer: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them ud_update: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them elastic_update: set the minimum break between retries from 10 to 30 seconds elasticizer: implementation of retries for elasticizer function, 10 retries with a break of 30 seconds in between	7 years ago
Erik de Vries	b042fdb1e3	Merge branch 'master' of https://git.thijsdevries.net/edevries/mamlr	7 years ago
Erik de Vries	8caf53b90a	actorizer: switched to single core processing for debugging	7 years ago
Erik de Vries	e5c87cf69d	actorizer: more debug prints	7 years ago
Erik de Vries	c63409238b	actorizer: print row numbers for debugging	7 years ago
Erik de Vries	39005c7518	elasticizer: Updated bulk size to 1024 (a power of 2) and set a timeout of 900s every 500000 updates query_gen_actors: Added an additional generator for the "Institution" type (for EU support) actorizer: Created an updater function to search for actors and use UDPipe to parse the results	7 years ago
Erik de Vries	a3c3651c79	elasticizer: updated scroll time to be longer than the timeouts every 200000 articles (so 20m scroll time, 900s (15m) timeout)	7 years ago
Erik de Vries	4ad5357e15	elasticizer: Added 900s timeout after every batch of 200000 articles when updating, to allow ES to do some segment merges (and clean up disk space)	7 years ago
Erik de Vries	a5ba00146f	modelizer: fixed error when only one class is predicted for junk classification (borderline case)	7 years ago
Erik de Vries	a13d86b92d	modelizer: added some more debug output	7 years ago
Erik de Vries	23658ce51a	test	7 years ago
Erik de Vries	17cf6d04e9	modelizer: debug update	7 years ago
Erik de Vries	7544e5323f	modelizer: update to allow tf both as count (for naive bayes), and as proportion (for other machine learning algorithms)	7 years ago
Erik de Vries	5f5e4a03c8	modelizer: Changed tf-idf weighting from absolute tf count to proportional (normalized) tf! Also added initial support for neural networks	7 years ago
Erik de Vries	34a6adf64e	changed udpipe output variable from tokens to ud	7 years ago
Erik de Vries	061da17c2a	ud_update: Added function to lemmatize documents	7 years ago
Erik de Vries	ef51ce60a9	Fixed dupe_detect error on documents with one sentence or less, and a maximum # of words in dfm_gen	7 years ago
Erik de Vries	0e8c127b86	bulk_writer: fixes for JSON generation and added exception for use of 'tokens' varname class_update/elastic_update: Moved response checking to elastic_update dupe_detect: Finalized dupe_detect	7 years ago
Erik de Vries	755a58d84d	dupe_detect: fix to prevent errors when a query returns no results	7 years ago
Erik de Vries	887f1aa774	dupe_detect: fix for empty results dataframe (no duplicates for given date and newspaper)	7 years ago
Erik de Vries	993f39957a	dfm_gen: word cutoff now as final step in script, caused bugs with mutating code variables	7 years ago
Erik de Vries	085252abda	documentation: updated dupe_detect and merger	7 years ago
Erik de Vries	02b8a8c1da	dfm_gen & merger: Changed word cutoff point to be a general setting in dfm_gen. Cuts off at the last [.?!] before the cutoff point (so returns documents at a sentence, shorter than cutoff).	7 years ago

1 2 3 4

174 Commits (4b4d8602355734452383391dd3c33462d85a28d4) All Branches Search

174 Commits (4b4d8602355734452383391dd3c33462d85a28d4)

All Branches