Advanced Tutorial 1: Dataset¶

Overview¶

In this tutorial, we will talk about the following topics:

Dataset Summary
Dataset Splitting
Global Dataset Editing
BatchDataset
Related Apphub Examples

Before going through the tutorial, it is recommended to check beginner tutorial 02 for basic understanding of dataset from PyTorch and FastEstimator. We will talk about more details about fe.dataset API in this tutorial.

Dataset summary¶

As we have mentioned in previous tutorial, users can import our inherited dataset class for easy use in Pipeline. But how do we know what keys are available in the dataset? Well, obviously one easy way is just call dataset[0] and check the keys. However, there's a more elegant way to check information of dataset: dataset.summary().

In [1]:

Copied!

from fastestimator.dataset.data.mnist import load_data
train_data, eval_data = load_data()
from fastestimator.dataset.data.mnist import load_data
train_data, eval_data = load_data()

In [2]:

Copied!

train_data.summary()
train_data.summary()

Out[2]:

<DatasetSummary {'num_instances': 60000, 'keys': {'x': <KeySummary {'shape': [28, 28], 'dtype': 'uint8'}>, 'y': <KeySummary {'num_unique_values': 10, 'shape': [], 'dtype': 'uint8'}>}}>

Or even more simply, by invoking the print function:

In [3]:

Copied!

print(train_data)
print(train_data)

{"keys": {"x": {"dtype": "uint8", "shape": [28, 28]}, "y": {"dtype": "uint8", "num_unique_values": 10, "shape": []}}, "num_instances": 60000}

Dataset Splitting¶

Dataset splitting is nothing new in machine learning. In FastEstimator, users can easily split their data in different ways.

Random Fraction Split¶

Let's say we want to randomly split 50% of the evaluation data into test data. This is easily accomplished:

In [4]:

Copied!

test_data = eval_data.split(0.5)
test_data = eval_data.split(0.5)

Or if I want to split evaluation data into two test datasets with 20% of the evaluation data each:

In [5]:

Copied!

test_data1, test_data2 = eval_data.split(0.2, 0.2)
test_data1, test_data2 = eval_data.split(0.2, 0.2)

Random Count Split¶

Sometimes instead of fractions, we want an actual number of examples to split; for example, randomly splitting 100 samples from the evaluation dataset:

In [6]:

Copied!

test_data3 = eval_data.split(100)
test_data3 = eval_data.split(100)

And of course, we can generate multiple datasets by providing multiple inputs:

In [7]:

Copied!

test_data4, test_data5 = eval_data.split(100, 100)
test_data4, test_data5 = eval_data.split(100, 100)

Index Split¶

There are times when we need to split the dataset in a specific way. For that, you can provide a list of indexes. For example, if we want to split the 0th, 1st and 100th element of evaluation dataset into new test set:

In [8]:

Copied!

test_data6 = eval_data.split([0,1,100])
test_data6 = eval_data.split([0,1,100])

If you just want continuous index, here's an easy way to provide index:

In [9]:

Copied!

test_data7 = eval_data.split(range(100))
test_data7 = eval_data.split(range(100))

Needless to say, you can provide multiple inputs too:

In [10]:

Copied!

test_data7, test_data8 = eval_data.split([0, 1 ,2], [3, 4, 5])
test_data7, test_data8 = eval_data.split([0, 1 ,2], [3, 4, 5])

Global Dataset Editing¶

In deep learning, we usually process the dataset batch by batch. However, when we are handling tabular data, we might need to apply some transformation globally before the training. For example, we may want to standardize the tabular data using sklearn:

In [11]:

Copied!





from fastestimator.dataset.data.breast_cancer import load_data
from sklearn.preprocessing import StandardScaler

train_data, eval_data = load_data()
scaler = StandardScaler()

train_data["x"] = scaler.fit_transform(train_data["x"])
eval_data["x"] = scaler.transform(eval_data["x"])
from fastestimator.dataset.data.breast_cancer import load_data
from sklearn.preprocessing import StandardScaler

train_data, eval_data = load_data()
scaler = StandardScaler()

train_data["x"] = scaler.fit_transform(train_data["x"])
eval_data["x"] = scaler.transform(eval_data["x"])

BatchDataset¶

There might be scenarios where we need to combine multiple datasets together into one dataset in a specific way. Let's consider three such use-cases now:

Deterministic Batching¶

Let's say we have mnist and cifar datasets, and want to combine them with a total batch size of 8. If we always want 4 examples from mnist and the rest from cifar:

In [12]:

Copied!





from fastestimator.dataset.data import mnist, cifar10
from fastestimator.dataset import BatchDataset

mnist_data, _ = mnist.load_data(image_key="x", label_key="y")
cifar_data, _ = cifar10.load_data(image_key="x", label_key="y")

dataset_deterministic = BatchDataset(datasets=[mnist_data, cifar_data], num_samples=[4,4])
# ready to use dataset_deterministic in Pipeline, you might need to resize them to have consistent shape
from fastestimator.dataset.data import mnist, cifar10
from fastestimator.dataset import BatchDataset

mnist_data, _ = mnist.load_data(image_key="x", label_key="y")
cifar_data, _ = cifar10.load_data(image_key="x", label_key="y")

dataset_deterministic = BatchDataset(datasets=[mnist_data, cifar_data], num_samples=[4,4])
# ready to use dataset_deterministic in Pipeline, you might need to resize them to have consistent shape

Distribution Batching¶

Some people prefer randomness in a batch. For example, given total batch size of 8, let's say we want 0.5 probability of mnist and the other 0.5 from cifar:

In [13]:

Copied!





from fastestimator.dataset.data import mnist, cifar10
from fastestimator.dataset import BatchDataset

mnist_data, _ = mnist.load_data(image_key="x", label_key="y")
cifar_data, _ = cifar10.load_data(image_key="x", label_key="y")

dataset_distribution = BatchDataset(datasets=[mnist_data, cifar_data], num_samples=8, probability=[0.5, 0.5])
# ready to use dataset_distribution in Pipeline, you might need to resize them to have consistent shape
from fastestimator.dataset.data import mnist, cifar10
from fastestimator.dataset import BatchDataset

mnist_data, _ = mnist.load_data(image_key="x", label_key="y")
cifar_data, _ = cifar10.load_data(image_key="x", label_key="y")

dataset_distribution = BatchDataset(datasets=[mnist_data, cifar_data], num_samples=8, probability=[0.5, 0.5])
# ready to use dataset_distribution in Pipeline, you might need to resize them to have consistent shape

Unpaired Dataset¶

Some deep learning tasks require random unpaired datasets. For example, in image-to-image translation (like Cycle-GAN), the system needs to randomly sample one horse image and one zebra image for every batch. In FastEstimator, BatchDataset can also handle unpaired datasets. The only restriction is that: keys from two different datasets must be unique for unpaired datasets.

For example, let's sample one image from mnist and one image from cifar for every batch:

In [14]:

Copied!





from fastestimator.dataset.data import mnist, cifar10
from fastestimator.dataset import BatchDataset

mnist_data, _ = mnist.load_data(image_key="x_mnist", label_key="y_mnist")
cifar_data, _ = cifar10.load_data(image_key="x_cifar", label_key="y_cifar")

dataset_unpaired = BatchDataset(datasets=[mnist_data, cifar_data], num_samples=[1,1])
# ready to use dataset_unpaired in Pipeline
from fastestimator.dataset.data import mnist, cifar10
from fastestimator.dataset import BatchDataset

mnist_data, _ = mnist.load_data(image_key="x_mnist", label_key="y_mnist")
cifar_data, _ = cifar10.load_data(image_key="x_cifar", label_key="y_cifar")

dataset_unpaired = BatchDataset(datasets=[mnist_data, cifar_data], num_samples=[1,1])
# ready to use dataset_unpaired in Pipeline

Apphub Examples¶

You can find some practical examples of the concepts described here in the following FastEstimator Apphubs:

DNN