tokenize
Tokenize
¶
Bases: NumpyOp
Split the sequences into tokens.
Tokenize split the document/sequence into tokens and at the same time perform additional operations on tokens if defined in the passed function object. By default, tokenize only splits the sequences into tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
Union[str, Iterable[str]]
|
Key(s) of sequences to be tokenized. |
required |
outputs |
Union[str, Iterable[str]]
|
Key(s) of sequences that are tokenized. |
required |
mode |
Union[None, str, Iterable[str]]
|
What mode(s) to execute this Op in. For example, "train", "eval", "test", or "infer". To execute regardless of mode, pass None. To execute in all modes except for a particular one, you can pass an argument like "!infer" or "!train". |
None
|
ds_id |
Union[None, str, Iterable[str]]
|
What dataset id(s) to execute this Op in. To execute regardless of ds_id, pass None. To execute in all ds_ids except for a particular one, you can pass an argument like "!ds1". |
None
|
tokenize_fn |
Union[None, Callable[[str], List[str]]]
|
Tokenization function object. |
None
|
to_lower_case |
bool
|
Whether to convert tokens to lowercase. |
False
|