If a document 23 of x contains 3 sentences in its 'text' then jl_split(x, "sentences") returns three new rows with other variables duplicated, new 'tokens' values, and doc_ids 23.1 23.2 and 23.3

jl_split(x, what = c("paragraphs", "sentences", "regex"), ...)

Arguments

x

a tibble

what

what unit to disaggregate a document to (default: paragraphs)

...

extra arguments to give to tokenizers::tokenize_*

Value

a tibble with new doc_id