If a document 23 of x contains 3 sentences in its 'text' then jl_split(x, "sentences") returns three new rows with other variables duplicated, new 'tokens' values, and doc_ids 23.1 23.2 and 23.3
jl_split(x, what = c("paragraphs", "sentences", "regex"), ...)
x | a tibble |
---|---|
what | what unit to disaggregate a document to (default: paragraphs) |
... | extra arguments to give to tokenizers::tokenize_* |
a tibble with new doc_id