Skip to content

imdb_review

load_data

Load and return the IMDB Movie review dataset.

This dataset contains 25,000 reviews labeled by sentiments (either positive or negative).

Parameters:

Name Type Description Default
max_len int

Maximum desired length of an input sequence.

required
vocab_size int

Vocabulary size to learn word embeddings.

required

Returns:

Type Description
Tuple[NumpyDataset, NumpyDataset]

(train_data, eval_data)

Source code in fastestimator/fastestimator/dataset/data/imdb_review.py
def load_data(max_len: int, vocab_size: int) -> Tuple[NumpyDataset, NumpyDataset]:
    """Load and return the IMDB Movie review dataset.

    This dataset contains 25,000 reviews labeled by sentiments (either positive or negative).

    Args:
        max_len: Maximum desired length of an input sequence.
        vocab_size: Vocabulary size to learn word embeddings.

    Returns:
        (train_data, eval_data)
    """
    (x_train, y_train), (x_eval, y_eval) = tf.keras.datasets.imdb.load_data(maxlen=max_len, num_words=vocab_size)
    # pad the sequences to max length
    x_train = np.array([pad(x, max_len, 0) for x in x_train])
    x_eval = np.array([pad(x, max_len, 0) for x in x_eval])

    train_data = NumpyDataset({"x": x_train, "y": y_train})
    eval_data = NumpyDataset({"x": x_eval, "y": y_eval})
    return train_data, eval_data

pad

Pad an input_list to a given size.

Parameters:

Name Type Description Default
input_list List[int]

The list to be padded.

required
padding_size int

The desired length of the returned list.

required
padding_value int

The value to be inserted for padding.

required

Returns:

Type Description
List[int]

input_list with padding_values appended until the padding_size is reached.

Source code in fastestimator/fastestimator/dataset/data/imdb_review.py
def pad(input_list: List[int], padding_size: int, padding_value: int) -> List[int]:
    """Pad an input_list to a given size.

    Args:
        input_list: The list to be padded.
        padding_size: The desired length of the returned list.
        padding_value: The value to be inserted for padding.

    Returns:
        `input_list` with `padding_value`s appended until the `padding_size` is reached.
    """
    return input_list + [padding_value] * abs((len(input_list) - padding_size))