Skip to content



Bases: NumpyOp

Split the sequences into tokens.

Tokenize split the document/sequence into tokens and at the same time perform additional operations on tokens if defined in the passed function object. By default, tokenize only splits the sequences into tokens.


Name Type Description Default
inputs Union[str, Iterable[str]]

Key(s) of sequences to be tokenized.

outputs Union[str, Iterable[str]]

Key(s) of sequences that are tokenized.

mode Union[None, str, Iterable[str]]

What mode(s) to execute this Op in. For example, "train", "eval", "test", or "infer". To execute regardless of mode, pass None. To execute in all modes except for a particular one, you can pass an argument like "!infer" or "!train".

ds_id Union[None, str, Iterable[str]]

What dataset id(s) to execute this Op in. To execute regardless of ds_id, pass None. To execute in all ds_ids except for a particular one, you can pass an argument like "!ds1".

tokenize_fn Union[None, Callable[[str], List[str]]]

Tokenization function object.

to_lower_case bool

Whether to convert tokens to lowercase.

Source code in fastestimator/fastestimator/op/numpyop/univariate/
class Tokenize(NumpyOp):
    """Split the sequences into tokens.

    Tokenize split the document/sequence into tokens and at the same time perform additional operations on tokens if
    defined in the passed function object. By default, tokenize only splits the sequences into tokens.

        inputs: Key(s) of sequences to be tokenized.
        outputs: Key(s) of sequences that are tokenized.
        mode: What mode(s) to execute this Op in. For example, "train", "eval", "test", or "infer". To execute
            regardless of mode, pass None. To execute in all modes except for a particular one, you can pass an argument
            like "!infer" or "!train".
        ds_id: What dataset id(s) to execute this Op in. To execute regardless of ds_id, pass None. To execute in all
            ds_ids except for a particular one, you can pass an argument like "!ds1".
        tokenize_fn: Tokenization function object.
        to_lower_case: Whether to convert tokens to lowercase.
    def __init__(self,
                 inputs: Union[str, Iterable[str]],
                 outputs: Union[str, Iterable[str]],
                 mode: Union[None, str, Iterable[str]] = None,
                 ds_id: Union[None, str, Iterable[str]] = None,
                 tokenize_fn: Union[None, Callable[[str], List[str]]] = None,
                 to_lower_case: bool = False) -> None:
        super().__init__(inputs=inputs, outputs=outputs, mode=mode, ds_id=ds_id)
        self.in_list, self.out_list = True, True
        self.tokenize_fn = tokenize_fn
        self.to_lower_case = to_lower_case

    def forward(self, data: List[str], state: Dict[str, Any]) -> List[List[str]]:
        return [self._apply_tokenization(seq) for seq in data]

    def _apply_tokenization(self, data: str) -> List[str]:
        """Split the sequence into tokens and apply lowercase if `do_lower_case` is set.

            data: Input sequence.

            A list of tokens.
        if self.tokenize_fn:
            data = self.tokenize_fn(data)
            data = data.split()
        if self.to_lower_case:
            data = list(map(lambda x: x.lower(), data))
        return data