Skip to content

wikitext_103

load_data

Download Wikitext-103 data and return its downloaded file path.

The training data contains 28475 wiki articles, 103 million tokens. The evaluation contains 60 wiki articles and 240k tokens. Since the original wikitext dataset url is no longer available, we are using dataset provided by huggingface datasets. The training dataset is provided as to parquet files and test and validation datasets are provided as single parquet file each. For simplicity we are providing only the first half of the training dataset with 900k rows.

Parameters:

Name Type Description Default
root_dir Optional[str]

Download parent path. Defaults to None.

None

Returns:

Type Description
Tuple[str, str, str]

Tuple[str, str, str]: the file path for train, eval and test split.

Source code in fastestimator/fastestimator/dataset/data/wikitext_103.py
def load_data(root_dir: Optional[str] = None) -> Tuple[str, str, str]:
    """Download Wikitext-103 data and return its downloaded file path.

    The training data contains 28475 wiki articles, 103 million tokens. The evaluation contains 60 wiki articles and
    240k tokens. Since the original wikitext dataset url is no longer available, we are using dataset provided by
    huggingface datasets. The training dataset is provided as to parquet files and test and validation datasets are
    provided as single parquet file each. For simplicity we are providing only the first half of the training dataset with 900k rows.

    Args:
        root_dir: Download parent path. Defaults to None.

    Returns:
        Tuple[str, str, str]: the file path for train, eval and test split.
    """
    # Set up path
    home = str(Path.home())
    if root_dir is None:
        root_dir = os.path.join(home, 'fastestimator_data', 'wiki_text_103')
    else:
        root_dir = os.path.join(os.path.abspath(root_dir), 'wiki_text_103')
    os.makedirs(root_dir, exist_ok=True)

    test_file = download_file(
        'https://huggingface.co/datasets/wikitext/resolve/main/wikitext-103-raw-v1/test-00000-of-00001.parquet',
        root_dir)
    eval_file = download_file(
        'https://huggingface.co/datasets/wikitext/resolve/main/wikitext-103-raw-v1/validation-00000-of-00001.parquet',
        root_dir)
    train_file = download_file(
        "https://huggingface.co/datasets/wikitext/resolve/main/wikitext-103-raw-v1/train-00000-of-00002.parquet",
        root_dir)
    return train_file, eval_file, test_file