mamlr

edevries

mamlr

Archived

Author	SHA1	Message	Date
Erik de Vries	0a3bdb630b	actorizer, dfm_gen, ud_update: unified output parsing from _source and highlight fields into a single function (out_parser) out_parser: function to parse raw text output into a single field, either from _source or highlight fields dupe_detect: updated function to use 'ver' parameter for versioning	6 years ago
Erik de Vries	9e5a1e3354	ud_update: removed mc.preschedule = F	6 years ago
Erik de Vries	c7560d7e32	ud_update: Removed . at end of text, and added mc.preschedule = F for testing	6 years ago
Erik de Vries	37df81b8ff	ud_update: fixed merged output field to always contain an (extra) dot (period) at the end of the document	6 years ago
Erik de Vries	c32c9e5ad3	ud_update: fix to deal with non-existing column names	6 years ago
Erik de Vries	8ffbddc073	actorizer, ud_update: implemented 'ver' variable for keeping track of updates	6 years ago
Erik de Vries	ae23456736	actorizer, ud_update: Updated merging of document fields to properly deal with missing punctuation at the end of fields (e.g. a title without punctuation at the end of the string) modelizer: Minor update to feature keyness, using absolute values now to determine the most informative features for a class (so features that are either strongly postively or negatively related to the class) bulk_writer: Added the 'ver' parameter to include a short version string with each update. Mostly to deal with updates that do not complete successfully on all data	6 years ago
Erik de Vries	9f3418ef37	class_update; dfm_gen; merger: updated functions to accept text parameter for both old style 'lemmas' and new style 'ud'	7 years ago
Erik de Vries	85aab558e0	bulk_writer: added clause to varname==ud update to also remove the tokens variable from source	7 years ago
Erik de Vries	581e7b2929	DESCRIPTION: added SparseM as required package	7 years ago
Erik de Vries	54dfb6a8ca	actorizer: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them ud_update: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them elastic_update: set the minimum break between retries from 10 to 30 seconds elasticizer: implementation of retries for elasticizer function, 10 retries with a break of 30 seconds in between	7 years ago
Erik de Vries	b042fdb1e3	Merge branch 'master' of https://git.thijsdevries.net/edevries/mamlr	7 years ago
Erik de Vries	8caf53b90a	actorizer: switched to single core processing for debugging	7 years ago
Erik de Vries	e5c87cf69d	actorizer: more debug prints	7 years ago
Erik de Vries	c63409238b	actorizer: print row numbers for debugging	7 years ago
Erik de Vries	39005c7518	elasticizer: Updated bulk size to 1024 (a power of 2) and set a timeout of 900s every 500000 updates query_gen_actors: Added an additional generator for the "Institution" type (for EU support) actorizer: Created an updater function to search for actors and use UDPipe to parse the results	7 years ago
Erik de Vries	a3c3651c79	elasticizer: updated scroll time to be longer than the timeouts every 200000 articles (so 20m scroll time, 900s (15m) timeout)	7 years ago
Erik de Vries	4ad5357e15	elasticizer: Added 900s timeout after every batch of 200000 articles when updating, to allow ES to do some segment merges (and clean up disk space)	7 years ago
Erik de Vries	a5ba00146f	modelizer: fixed error when only one class is predicted for junk classification (borderline case)	7 years ago
Erik de Vries	a13d86b92d	modelizer: added some more debug output	7 years ago
Erik de Vries	23658ce51a	test	7 years ago
Erik de Vries	17cf6d04e9	modelizer: debug update	7 years ago
Erik de Vries	7544e5323f	modelizer: update to allow tf both as count (for naive bayes), and as proportion (for other machine learning algorithms)	7 years ago
Erik de Vries	5f5e4a03c8	modelizer: Changed tf-idf weighting from absolute tf count to proportional (normalized) tf! Also added initial support for neural networks	7 years ago
Erik de Vries	34a6adf64e	changed udpipe output variable from tokens to ud	7 years ago
Erik de Vries	061da17c2a	ud_update: Added function to lemmatize documents	7 years ago
Erik de Vries	ef51ce60a9	Fixed dupe_detect error on documents with one sentence or less, and a maximum # of words in dfm_gen	7 years ago
Erik de Vries	0e8c127b86	bulk_writer: fixes for JSON generation and added exception for use of 'tokens' varname class_update/elastic_update: Moved response checking to elastic_update dupe_detect: Finalized dupe_detect	7 years ago
Erik de Vries	755a58d84d	dupe_detect: fix to prevent errors when a query returns no results	7 years ago
Erik de Vries	887f1aa774	dupe_detect: fix for empty results dataframe (no duplicates for given date and newspaper)	7 years ago
Erik de Vries	993f39957a	dfm_gen: word cutoff now as final step in script, caused bugs with mutating code variables	7 years ago
Erik de Vries	085252abda	documentation: updated dupe_detect and merger	7 years ago
Erik de Vries	02b8a8c1da	dfm_gen & merger: Changed word cutoff point to be a general setting in dfm_gen. Cuts off at the last [.?!] before the cutoff point (so returns documents at a sentence, shorter than cutoff).	7 years ago
Erik de Vries	4a713ddc23	bulk_writer: setting names(x) <- NULL when there is only one value (list or otherwise) to be updated. This is because R apply treats rows of single values as a matrix, while it treats rows containing lists as (named) list. This has the nasty result of getting subvalues when using to JSON. i.e. computerCodes.actors = [list, of, ids] becomes computerCodes.actors.ids = [list, of, ids].	7 years ago
Erik de Vries	6bb8f9b635	class_update: added explicit httr::: references	7 years ago
Erik de Vries	f543d658bd	Major overhaul to ES bulk update integration. Added support for both setting and appending to variables	7 years ago
Erik de Vries	4adae2bbc6	Fixed bug in dupe_detect caused by switch from cutoff to cutoff_lower/upper	7 years ago
Erik de Vries	4cd46d1a5e	dupe_detect: added support for both lower and upper cutoff point	7 years ago
Erik de Vries	11d8b31c60	Added generic actor search query generator. Updated elasticizer and elastic_update to connect either to the remote server, or a local ES instance	7 years ago
Erik de Vries	3e66c7e1cd	Updated dfm_gen to have all topic vectors as numeric variables	7 years ago
Erik de Vries	20d7510a89	Merge branch 'master' of https://git.thijsdevries.net/edevries/mamlr	7 years ago
Erik de Vries	adc4b3c639	Updated feature selection in modelizer function (see comment on lines 166/167)	7 years ago
Erik de Vries	919e71ac68	Updated feature selection in modelizer function (see comment on lines 166/167)	7 years ago
Erik de Vries	65f8c26ec6	Renamed dupe_detect, and added return output	7 years ago
Erik de Vries	db418d7396	Add query_string function for generating query_string queries	7 years ago
Erik de Vries	d203de0b2a	Updated elasticizer docs, created modelizer and class_update functions	7 years ago
Erik de Vries	c815dc7f2b	Duplicate detection first commit	7 years ago
Erik de Vries	1f06b0b716	Lowered R version req to 3.3.1	7 years ago
Erik de Vries	015411feaf	Added refresh=wait_for to bulk update url. This should make update scripts less demanding on the server side, because the server only replies after refreshing (happens every second)	7 years ago
Erik de Vries	413ad02a87	Set default to "lemmas" for dfm_gen	7 years ago

1 2 3 4 5

207 Commits (523d86799c34e8b5e42e9591ef9c7018edf63f76) All Branches Search

207 Commits (523d86799c34e8b5e42e9591ef9c7018edf63f76)

All Branches