mitmovie_ner
get_sentences_and_labels
¶
Combines tokens into sentences and create vocab set for train data and labels.
For simplicity tokens with 'O' entity are omitted.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
Path to the downloaded dataset file. |
required |
Returns:
Type | Description |
---|---|
Tuple[List[str], List[List[str]], Set[str], Set[str]]
|
(sentences, labels, train_vocab, label_vocab) |
Source code in fastestimator/fastestimator/dataset/data/mitmovie_ner.py
load_data
¶
Load and return the MIT Movie dataset.
MIT Movies dataset is a semantically tagged training and test corpus in BIO format. The sentence is encoded as one token per line with information provided in tab-seprated columns. Sourced from https://groups.csail.mit.edu/sls/downloads/movie/
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root_dir |
Optional[str]
|
The path to store the downloaded data. When |
None
|
Returns:
Type | Description |
---|---|
Tuple[NumpyDataset, NumpyDataset, Set[str], Set[str]]
|
(train_data, eval_data, train_vocab, label_vocab) |