Download Wikitext-103 data and return its downloaded file path.
The training data contains 28475 wiki articles, 103 million tokens. The evaluation contains 60 wiki articles and
240k tokens. Since the original wikitext dataset url is no longer available, we are using dataset provided by
huggingface datasets. The training dataset is provided as to parquet files and test and validation datasets are
provided as single parquet file each. For simplicity we are providing only the first half of the training dataset with 900k rows.
Parameters:
Name |
Type |
Description |
Default |
root_dir
|
Optional[str]
|
Download parent path. Defaults to None.
|
None
|
Returns:
Type |
Description |
Tuple[str, str, str]
|
Tuple[str, str, str]: the file path for train, eval and test split.
|
Source code in fastestimator/fastestimator/dataset/data/wikitext_103.py
| def load_data(root_dir: Optional[str] = None) -> Tuple[str, str, str]:
"""Download Wikitext-103 data and return its downloaded file path.
The training data contains 28475 wiki articles, 103 million tokens. The evaluation contains 60 wiki articles and
240k tokens. Since the original wikitext dataset url is no longer available, we are using dataset provided by
huggingface datasets. The training dataset is provided as to parquet files and test and validation datasets are
provided as single parquet file each. For simplicity we are providing only the first half of the training dataset with 900k rows.
Args:
root_dir: Download parent path. Defaults to None.
Returns:
Tuple[str, str, str]: the file path for train, eval and test split.
"""
# Set up path
home = str(Path.home())
if root_dir is None:
root_dir = os.path.join(home, 'fastestimator_data', 'wiki_text_103')
else:
root_dir = os.path.join(os.path.abspath(root_dir), 'wiki_text_103')
os.makedirs(root_dir, exist_ok=True)
test_file = download_file(
'https://huggingface.co/datasets/wikitext/resolve/main/wikitext-103-raw-v1/test-00000-of-00001.parquet',
root_dir)
eval_file = download_file(
'https://huggingface.co/datasets/wikitext/resolve/main/wikitext-103-raw-v1/validation-00000-of-00001.parquet',
root_dir)
train_file = download_file(
"https://huggingface.co/datasets/wikitext/resolve/main/wikitext-103-raw-v1/train-00000-of-00002.parquet",
root_dir)
return train_file, eval_file, test_file
|