mamlr

Commit Graph

Author	SHA1	Message	Date
Erik de Vries	a29d04dacd	actorizer: fixed handling of empty results due to regex filtering	6 years ago
Erik de Vries	8051a81b66	actorizer, dfm_gen, modelizer, out_parser: replaced all instances of detectCores by cores parameter (which defaults to detectCores)	6 years ago
Erik de Vries	f6006eb9ba	actorizer: simplified pre/postfix check, only for NA, replace empty strings by NA beforehand	6 years ago
Erik de Vries	298099a4e6	actorizer: fix to deal with empty updates (ie dont do an update)	6 years ago
Erik de Vries	703b5e59a4	actorizer: fixed exceptionizer by adding whitespace before and after sentence, which is necessary because of negative regex (match anything before or after the highlight string that is NOT x actually requires something to be in front or after)	6 years ago
Erik de Vries	593d2de6e2	actorizer: add pre_tags and post_tags to argument list bulk_writer: updated to use _doc doctype query_gen_actors: added NA for all searches that don't have pre- or postfixes	6 years ago
Erik de Vries	a1b6c6a7cb	actorizer, query_gen_actors: revamped actor searches entirely elasticizer: updated script for use with ES 7.x	6 years ago
Erik de Vries	6414f759bd	actorizer: parallelized calculation of marker positions	6 years ago
Erik de Vries	5b9793cd8c	actorizer: removed nested mclapply	6 years ago
Erik de Vries	1a4ba19546	actorizer: Removed udmodel dependencies, commented code, changed nested lists to flat lists bulk_writer: changed handling of single-row dataframe parsing to JSON elastic_update: changed function to return instead of print appData on error ud_update: Changed nested lists to flat lists, and added start and end character positions	6 years ago
Erik de Vries	3abc3056e0	actorizer: fix to columns selected for actors variable, removed udmodel requirement	6 years ago
Erik de Vries	41c86ea116	actorizer, ud_update: Updated ud parsing and actorizer to work based on character positions. This code is used for local testing	6 years ago
Erik de Vries	eae1a22609	actorizer: update to use '\|\|\|' as highlight indicator, and set up ud output merging accordingly	6 years ago
Erik de Vries	5665b6d622	actorizer: more fixes to punctuation	6 years ago
Erik de Vries	cd05733648	actorizer: Additional fix for missing punctuation (see previous commit)	6 years ago
Erik de Vries	09732a1b5a	actorizer: quick fix for problem where original UK UD output does not have a dot at the end of the document, but the actor output does (old vs new parsing)	6 years ago
Erik de Vries	835d2332bc	actorizer: now uses the original udpipe output for sentence and token ids. When the actorized and original udpipe output do not have the same number of rows, it prints an error and sets err to TRUE in actorDetails	6 years ago
Erik de Vries	e70b6ccf7a	actorizer: fixed sentence_count and out_parser calls out_parser: Added comment with old regex	6 years ago
Erik de Vries	4407a99774	actorizer: fix to get actual number of sentence occurences of actor	6 years ago
Erik de Vries	96e869fa6b	actorizer: previous commit was wrong, only add is an option, removed type variable	6 years ago
Erik de Vries	98219c807c	actorizer: Added type option, to choose between setting or adding to the actor variables, defaults to add (should normally not be changed)	6 years ago
Erik de Vries	e3b57ed9e3	actorizer: added clean = F to have the exact same behavior in ud_update and actorizer	6 years ago
Erik de Vries	0a3bdb630b	actorizer, dfm_gen, ud_update: unified output parsing from _source and highlight fields into a single function (out_parser) out_parser: function to parse raw text output into a single field, either from _source or highlight fields dupe_detect: updated function to use 'ver' parameter for versioning	6 years ago
Erik de Vries	8ffbddc073	actorizer, ud_update: implemented 'ver' variable for keeping track of updates	6 years ago
Erik de Vries	ae23456736	actorizer, ud_update: Updated merging of document fields to properly deal with missing punctuation at the end of fields (e.g. a title without punctuation at the end of the string) modelizer: Minor update to feature keyness, using absolute values now to determine the most informative features for a class (so features that are either strongly postively or negatively related to the class) bulk_writer: Added the 'ver' parameter to include a short version string with each update. Mostly to deal with updates that do not complete successfully on all data	6 years ago
Erik de Vries	54dfb6a8ca	actorizer: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them ud_update: major fix to ud parsing, changed regex to remove html tags to only include tags with a maximum of 20 characters in them elastic_update: set the minimum break between retries from 10 to 30 seconds elasticizer: implementation of retries for elasticizer function, 10 retries with a break of 30 seconds in between	6 years ago
Erik de Vries	8caf53b90a	actorizer: switched to single core processing for debugging	6 years ago
Erik de Vries	c63409238b	actorizer: print row numbers for debugging	6 years ago
Erik de Vries	39005c7518	elasticizer: Updated bulk size to 1024 (a power of 2) and set a timeout of 900s every 500000 updates query_gen_actors: Added an additional generator for the "Institution" type (for EU support) actorizer: Created an updater function to search for actors and use UDPipe to parse the results	6 years ago

29 Commits (9e433ecf9e0a323b2596aa290d7e53cee3f2aadd)