Advanced Tutorial 1: Dataset¶
Overview¶
In this tutorial, we will talk about the following topics:
- Dataset Summary
- Dataset Splitting
- Global Dataset Editing
- BatchDataset
- InterleaveDataset
- Related Apphub Examples
Before going through the tutorial, it is recommended to check Beginner Tutorial 2 for basic understanding of dataset
from PyTorch and FastEstimator. We will talk about more details about fe.dataset
API in this tutorial.
Dataset summary¶
As we have mentioned in previous tutorial, users can import our inherited dataset class for easy use in Pipeline
. But how do we know what keys are available in the dataset? Well, obviously one easy way is just call dataset[0]
and check the keys. However, there's a more elegant way to check information of dataset: dataset.summary()
.
from fastestimator.dataset.data.mnist import load_data
train_data, eval_data = load_data()
train_data.summary()
<DatasetSummary {'num_instances': 60000, 'keys': {'x': <KeySummary {'shape': [28, 28], 'dtype': 'uint8'}>, 'y': <KeySummary {'num_unique_values': 10, 'shape': [], 'dtype': 'uint8'}>}}>
Or even more simply, by invoking the print function:
print(train_data)
{"num_instances": 60000, "keys": {"x": {"shape": [28, 28], "dtype": "uint8"}, "y": {"num_unique_values": 10, "shape": [], "dtype": "uint8"}}}
Dataset Splitting¶
Dataset splitting is nothing new in machine learning. In FastEstimator, users can easily split their data in different ways.
Random Fraction Split¶
Let's say we want to randomly split 50% of the evaluation data into test data. This is easily accomplished by the following way. As a result of the split, the data in test_data
is removed from eval_data
instance.
test_data = eval_data.split(0.5)
In addition, if you want to split evaluation data into two test datasets with 20% of the evaluation data each:
test_data1, test_data2 = eval_data.split(0.2, 0.2)
Random Count Split¶
Sometimes instead of fractions, we want an actual number of examples to split; for example, randomly splitting 100 samples from the evaluation dataset:
test_data3 = eval_data.split(100)
And of course, we can generate multiple datasets by providing multiple inputs:
test_data4, test_data5 = eval_data.split(100, 100)
Index Split¶
There are times when we need to split the dataset in a specific way. For that, you can provide a list of indexes. For example, if we want to split the 0th, 1st and 100th element of evaluation dataset into new test set:
test_data6 = eval_data.split([0,1,100])
If you just want continuous index, here's an easy way to provide index:
test_data7 = eval_data.split(range(100))
Needless to say, you can provide multiple inputs too:
test_data7, test_data8 = eval_data.split([0, 1 ,2], [3, 4, 5])
Global Dataset Editing¶
In deep learning, we usually process the dataset batch by batch. However, when we are handling tabular data, we might need to apply some transformation globally before the training. For example, we may want to standardize the tabular data using sklearn
:
from fastestimator.dataset.data.breast_cancer import load_data
from sklearn.preprocessing import StandardScaler
train_data, eval_data = load_data()
scaler = StandardScaler()
train_data["x"] = scaler.fit_transform(train_data["x"])
eval_data["x"] = scaler.transform(eval_data["x"])
Another popular use case of global dataset editing is when you wanted to add a new feature globally to all samples to a dataset. For example, each sample of the above dataset currently has two keys: x
and y
:
print(train_data[0])
{'x': array([-1.4407529 , -0.43531948, -1.3620849 , -1.139118 , 0.7805734 , 0.7189211 , 2.8231344 , -0.11914958, 1.0926621 , 2.458172 , -0.2638004 , -0.01605244, -0.4704136 , -0.4747609 , 0.8383651 , 3.251027 , 8.438936 , 3.3919873 , 2.6211658 , 2.0612078 , -1.2328612 , -0.47630954, -1.2479202 , -0.9739676 , 0.7228946 , 1.1867324 , 4.672828 , 0.9320124 , 2.0972424 , 1.8864503 ], dtype=float32), 'y': 1}
Let's add a new key named data_name
, and apply globally to all samples of the dataset:
train_data["data_name"] = ["breast_cancer" for _ in range(len(train_data))]
print(train_data[0])
{'x': array([-1.4407529 , -0.43531948, -1.3620849 , -1.139118 , 0.7805734 , 0.7189211 , 2.8231344 , -0.11914958, 1.0926621 , 2.458172 , -0.2638004 , -0.01605244, -0.4704136 , -0.4747609 , 0.8383651 , 3.251027 , 8.438936 , 3.3919873 , 2.6211658 , 2.0612078 , -1.2328612 , -0.47630954, -1.2479202 , -0.9739676 , 0.7228946 , 1.1867324 , 4.672828 , 0.9320124 , 2.0972424 , 1.8864503 ], dtype=float32), 'y': 1, 'data_name': 'breast_cancer'}
Now every sample has an additional key and its value, this can be used further in Operator or Trace to do dataset-conditioned operations.
BatchDataset¶
There might be scenarios where we need to combine multiple datasets together into one dataset in a specific way. Let's consider three such use-cases now:
Deterministic Batching¶
Let's say we have mnist
and cifair
datasets, and want to combine them with a total batch size of 8. If we always want 4 examples from mnist
and the rest from cifair
:
from fastestimator.dataset.data import mnist, cifair10
from fastestimator.dataset import BatchDataset
mnist_data, _ = mnist.load_data(image_key="x", label_key="y")
cifair_data, _ = cifair10.load_data(image_key="x", label_key="y")
dataset_deterministic = BatchDataset(datasets=[mnist_data, cifair_data], num_samples=[4,4])
# ready to use dataset_deterministic in Pipeline, you might need to resize them to have consistent shape
Distribution Batching¶
Some people prefer randomness in a batch. For example, given total batch size of 8, let's say we want 0.5 probability of mnist
and the other 0.5 from cifair
:
from fastestimator.dataset.data import mnist, cifair10
from fastestimator.dataset import BatchDataset
mnist_data, _ = mnist.load_data(image_key="x", label_key="y")
cifair_data, _ = cifair10.load_data(image_key="x", label_key="y")
dataset_distribution = BatchDataset(datasets=[mnist_data, cifair_data], num_samples=8, probability=[0.5, 0.5])
# ready to use dataset_distribution in Pipeline, you might need to resize them to have consistent shape
Unpaired Dataset¶
Some deep learning tasks require random unpaired datasets. For example, in image-to-image translation (like Cycle-GAN), the system needs to randomly sample one horse image and one zebra image for every batch. In FastEstimator, BatchDataset
can also handle unpaired datasets. The only restriction is that: keys from two different datasets must be unique for unpaired datasets.
For example, let's sample one image from mnist
and one image from cifair
for every batch:
from fastestimator.dataset.data import mnist, cifair10
from fastestimator.dataset import BatchDataset
mnist_data, _ = mnist.load_data(image_key="x_mnist", label_key="y_mnist")
cifair_data, _ = cifair10.load_data(image_key="x_cifair", label_key="y_cifair")
dataset_unpaired = BatchDataset(datasets=[mnist_data, cifair_data], num_samples=[1,1])
# ready to use dataset_unpaired in Pipeline
InterleaveDataset¶
When you train a network that can perform different tasks from multiple datasets, it is generally a good practice to mix different datasets in a batch for the ease of convergence. Unfortunately, it may not be possible to merge multiple datasets into one batch sometimes due to reasons like:
- Not enough GPU memory: When you have dozens of datasets & tasks or each sample's data dimension is too large, then you may not even fit a batch size of 1 for every dataset.
- Inconsistent data dimension: For a batch to form, it requires each sample having the same spatial dimension. However, this may not be feasible under certain situations. For example, you might have one dataset with resolution of [128, 128, 9] and another dataset with resolution of [384, 384, 256]. Resizing both datasets to a single size would inevitably introduce artifact or loss of information.
To overcome such challenge, one solution is to distribute multiple datasets across different training steps in a particular pattern. For example, one way to do so can be:
- step 1: train on dataset1
- step 2: train on dataset2
- step 3: train on dataset1
- step 4: train on dataset2
- ...
We define this general way of dataset distribution as Dataset Interleaving
.
Dataset Interleaving¶
One can simply achieve Dataset Interleaving by using InterleaveDataset
API:
from fastestimator.dataset.numpy_dataset import NumpyDataset
from fastestimator.dataset.interleave_dataset import InterleaveDataset
data1 = NumpyDataset(data={"x": [x for x in range(10)], "ds_id": [0 for _ in range(10)]})
data2 = NumpyDataset(data={"x": [x for x in range(10)], "ds_id": [1 for _ in range(10)]})
interleave_data = InterleaveDataset(datasets=[data1, data2])
for idx in range(5):
print("step: {}, using dataset id: {}".format(idx, interleave_data[idx][0]['ds_id']))
step: 0, using dataset id: 0 step: 1, using dataset id: 1 step: 2, using dataset id: 0 step: 3, using dataset id: 1 step: 4, using dataset id: 0
Custom-Pattern Interleaving¶
By default, InterleaveDataset
switch dataset in a rotation. Sometimes people prefer a specific pattern of rotation. For example, 2 steps of dataset1 followed up 3 step of dataset2:
interleave_data = InterleaveDataset(datasets=[data1, data2], pattern=[0, 0, 1, 1, 1])
for idx in range(6):
print("step: {}, using dataset id: {}".format(idx, interleave_data[idx][0]['ds_id']))
step: 0, using dataset id: 0 step: 1, using dataset id: 0 step: 2, using dataset id: 1 step: 3, using dataset id: 1 step: 4, using dataset id: 1 step: 5, using dataset id: 0
Note that when an interleaving pattern is defined, the length of the interleave dataset might shrink to guarantee full cycles.
Operator Control¶
Since InterleaveDataset
involves multiple data sources, then later in pipeline we might some operators condition on specific data source. For example, dataset1 might be a grey-scale image dataset, dataset2 might be colored image dataset. We may need a specific ReadImage
Op for dataset1, and another ReadImage
Op for dataset2.
For such use cases, InterleaveDataset
supports another input syntax below:
interleave_data = InterleaveDataset(datasets={"a": data1, "b": data2}, pattern=["a", "a", "b", "b", "b"])
for idx in range(5):
print("step: {}, using dataset id: {}".format(idx, interleave_data[idx][0]['ds_id']))
step: 0, using dataset id: 0 step: 1, using dataset id: 0 step: 2, using dataset id: 1 step: 3, using dataset id: 1 step: 4, using dataset id: 1
Once defined in the dictionary syntax, user can now plug in corresponding key name in Pipeline Operator's ds_id
argument to condition operator on specific data source:
import fastestimator as fe
from fastestimator.op.numpyop import NumpyOp
class PlusHalf(NumpyOp):
def forward(self, data, state):
return data + 0.5
pipeline = fe.Pipeline(train_data=interleave_data, ops=[PlusHalf(inputs="x", outputs="x", ds_id="a")])
batches = pipeline.get_results(mode="train", num_steps=5)
for idx, batch in enumerate(batches):
print("step: {}, dataset_id: {}, x: {}".format(idx, batch['ds_id'].item(), batch['x'].item()))
step: 0, dataset_id: 0, x: 7.5 step: 1, dataset_id: 0, x: 8.5 step: 2, dataset_id: 1, x: 6 step: 3, dataset_id: 1, x: 8 step: 4, dataset_id: 1, x: 9
As we can see, the PlusHalf
is only activated for data1 (corresponding to key a
), while the data2 (corresponding to key b
) is left as is. Note that currently for InterleaveDataset, the operator conditioning is only available among Pipeline Operators.
Batch Control¶
Now that we know Pipeline Operators can be conditioned on a particular data source for InterleaveDataset
, we can use different batch size for different data sources so that different data sources can fit in the GPU memory. This is done through the Batch
Operator.
from fastestimator.op.numpyop import Batch
interleave_data = InterleaveDataset(datasets={"a": data1, "b": data2})
pipeline = fe.Pipeline(train_data=interleave_data,
ops=[Batch(batch_size=2, ds_id="a"),
Batch(batch_size=3, ds_id="b")])
batches = pipeline.get_results(mode="train", num_steps=4)
for idx, batch in enumerate(batches):
print("step: {}, dataset_id: {}, batch_size: {}".format(idx, batch['ds_id'].numpy(), batch['ds_id'].size(0)))
step: 0, dataset_id: [0 0], batch_size: 2 step: 1, dataset_id: [1 1 1], batch_size: 3 step: 2, dataset_id: [0 0], batch_size: 2 step: 3, dataset_id: [1 1 1], batch_size: 3