actorizer: updated to account for token start offset in udpipe output. Sometimes, the first token in an article doesn't start at character position 1 (or 2 if the article starts with a whitespace), but at position 16 and possibly other positions.
## Computing offset for first token position (some articles have a minimum token start position of 16, instead of 1 or 2)
mutate(# Checking if the merged field starts with a whitespace character
space=case_when(
str_starts(merged,'\\s')~1,
T~0)
)%>%
unnest(cols='_source.ud')%>%
rowwise()%>%
mutate(ud_min=min(unlist(start))-1-space)## Create offset variable, subtract 1 for default token start position of 1, and subtract 1 if merged field starts with a whitespace
print(str_c('Number of articles with minimum token start position higher than 2: ',sum(out$ud_min>2)))