Tutorial 11: Debugging¶
Overview¶
In this tutorial we are going to cover:
Pipeline Debugging¶
In Tutorial 4 we demonstrated what the Pipeline is and how it is created to handle different preprocessing tasks using NumpyOp
s. Since the pipeline consists of a series of NumpyOps, it's a vital to know how to debug those NumpyOp
s.
It is also a good practice to inspect the results of the pipeline and ensure that the output is as you expected.
There are two ways we can debug the Pipeline
,
- Debug a single
NumpyOp
- Debug and verify the results of
Pipeline
Debugging a Single NumpyOp¶
We will first create a simple Pipeline
with a few NumpyOps
that add random noise and rotate the image.
Now, if we want to debug the variable values in the AddNoise
op we will do the following,
- Set the num_process=0 to disable the multiprocessing
- Add your choice of a debugger such as the python debugger (PDB), an IDE specific debugger, or print statement in the
NumpyOp
import numpy as np
import fastestimator as fe
from fastestimator.dataset.data import mnist
from fastestimator.op.numpyop import NumpyOp
from fastestimator.op.numpyop.multivariate import Rotate
from fastestimator.architecture.tensorflow import LeNet
train_data, eval_data = mnist.load_data()
test_data = eval_data.split(0.5)
model = fe.build(model_fn=LeNet, optimizer_fn="adam")
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:80: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:112.) return torch._C._cuda_getDeviceCount() > 0 2022-11-04 18:48:06.374570: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: UNKNOWN ERROR (34) 2022-11-04 18:48:06.374629: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist 2022-11-04 18:48:06.374821: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Single NumpyOp
can be debugged in two ways,
- Using
Pipeline.transform
- Running the training loop
Using Pipeline.transform¶
class AddNoiseDebug(NumpyOp):
def __init__(self, inputs, outputs, mode = None):
super().__init__(inputs, outputs, mode)
def forward(self, data, state):
noise = np.random.normal(0, 1, size=data.shape)
print('Noise shape ', noise.shape)
print('data shape ', data.shape) # add print statement to check the data and noise
data = data + noise
return data
debug_pipeline = fe.Pipeline(train_data=train_data,
eval_data=eval_data,
test_data=test_data,
batch_size=3,
ops=[AddNoiseDebug(inputs='x', outputs='x_out'),
Rotate(image_in="x_out", image_out="x_out", limit=60)],
num_process=0)
results = debug_pipeline.transform(train_data[0], mode='train')
Noise shape (28, 28) data shape (28, 28)
Running the Training Loop¶
from fastestimator.op.tensorop.loss import CrossEntropy
from fastestimator.op.tensorop.model import ModelOp, UpdateOp
network = fe.Network(ops=[ModelOp(model=model, inputs="x_out", outputs="y_pred"),
CrossEntropy(inputs=("y_pred", "y"), outputs="ce", mode="!infer"),
UpdateOp(model=model, loss_name="ce", mode="train")])
NOTE: The training logs will print extra debug messages if warmup is not set to False in Estimator.fit
. When warmup is enabled it will perform a test run on both Pipeline
and Network
to make sure that training will not fail in later stages.
estimator = fe.Estimator(pipeline=debug_pipeline,
network=network,
epochs=1,
train_steps_per_epoch=1,
eval_steps_per_epoch=1)
estimator.fit(warmup=False)
______ __ ______ __ _ __ / ____/___ ______/ /_/ ____/____/ /_(_)___ ___ ____ _/ /_____ _____ / /_ / __ `/ ___/ __/ __/ / ___/ __/ / __ `__ \/ __ `/ __/ __ \/ ___/ / __/ / /_/ (__ ) /_/ /___(__ ) /_/ / / / / / / /_/ / /_/ /_/ / / /_/ \__,_/____/\__/_____/____/\__/_/_/ /_/ /_/\__,_/\__/\____/_/ FastEstimator-Warn: No ModelSaver Trace detected. Models will not be saved. FastEstimator-Start: step: 1; logging_interval: 100; num_device: 0; Noise shape (28, 28) data shape (28, 28) FastEstimator-Warn: The following key(s) are being pruned since they are unused outside of the Pipeline. To prevent this, you can declare the key(s) as inputs to Traces or TensorOps: x FastEstimator-Train: step: 1; ce: 27.005692; FastEstimator-Train: step: 1; epoch: 1; epoch_time(sec): 2.64; Noise shape (28, 28) data shape (28, 28) Eval Progress: 1/1; FastEstimator-Eval: step: 1; epoch: 1; ce: 19.016918; FastEstimator-Finish: step: 1; model_lr: 0.001; total_time(sec): 2.89;
Debugging and Verifying the Pipeline Results¶
In order to debug and verify the pipeline results, we will use the pipeline.get_results()
. You can also visualize the results using the utility functions.
from fastestimator.util import BatchDisplay, GridDisplay, ImageDisplay
data = debug_pipeline.get_results()
img = GridDisplay([BatchDisplay(image=data["x"], title="original image"), BatchDisplay(image=data["x_out"], title="pipeline output")])
img.show()
Noise shape (28, 28) data shape (28, 28) Noise shape (28, 28) data shape (28, 28) Noise shape (28, 28) data shape (28, 28)
Network Debugging¶
Network
defines the model and the operations that need to be performed on it. It is composed of series of TensorOps and can be debugged in similar fashion as the Pipeline
.
- Debugging a single
TensorOp
- Verifying and debugging the network results
Debugging a Single TensorOp¶
A TensorOp can be debugged in two ways,
- Using
network.transform
- Running the training loop
Using Network.transform¶
We will add a Custom TensorOp
that will print the value of prediction and run the forward step through the network using network.transform
.
import tensorflow as tf
from fastestimator.op.tensorop import TensorOp
class CustomTensorOp(TensorOp):
def forward(self, data, state):
pred = data[0]
labels = data[1]
print('Predictions:\n ', pred)
pipeline = fe.Pipeline(train_data=train_data,
eval_data=eval_data,
test_data=test_data,
batch_size=3,
ops=[Rotate(image_in="x", image_out="x_out", limit=60)])
network = fe.Network(ops=[ModelOp(model=model, inputs="x_out", outputs="y_pred"), #default mode=None
CrossEntropy(inputs=("y_pred", "y"), outputs="ce"),
CustomTensorOp(inputs=("y_pred", "y")),
UpdateOp(model=model, loss_name="ce", mode="train")])
test_data = pipeline.get_results(mode="test")
test_data = network.transform(test_data, mode="test")
Predictions: tf.Tensor( [[1.84605952e-13 1.35236891e-21 8.96912971e-26 7.53040075e-10 2.92707147e-09 4.52285354e-24 2.93159675e-13 1.13306120e-09 1.00000000e+00 4.61551404e-14] [8.96088013e-15 1.07297582e-11 6.47558979e-22 4.55104646e-06 1.12154931e-01 8.58035724e-24 2.32068499e-16 8.77452672e-01 1.03877764e-02 4.05513090e-09] [1.15011451e-12 1.38708302e-24 1.12393150e-11 1.17335359e-08 7.94216305e-07 1.21484445e-23 3.15572667e-20 1.02369202e-09 2.45853816e-03 9.97540593e-01]], shape=(3, 10), dtype=float32)
As you can see, the predictions are getting displayed from the print statement in the CustomTensorOp
Running the Training Loop¶
class CustomTensorOp(TensorOp):
def forward(self, data, state):
pred = data[0]
labels = data[1]
tf.print(pred.shape) # tf.print will work in both graph mode and eager mode
print(labels.shape) # print only work in eager mode for tensorflow
network = fe.Network(ops=[ModelOp(model=model, inputs="x_out", outputs="y_pred"), #default mode=None
CrossEntropy(inputs=("y_pred", "y"), outputs="ce"),
CustomTensorOp(inputs=("y_pred", "y")),
UpdateOp(model=model, loss_name="ce", mode="train")])
If you are using the Tensorflow backend it's important to set eager=True
in the Estimator.fit()
or can use the tf.print
function to print in graph mode. For the PyTorch backend, you can use any of your favorite debuggers and no graph mode settings are required.
estimator = fe.Estimator(pipeline=pipeline,
network=network,
epochs=2,
train_steps_per_epoch=2,
eval_steps_per_epoch=1,
log_steps=None)
estimator.fit(eager=True)
______ __ ______ __ _ __ / ____/___ ______/ /_/ ____/____/ /_(_)___ ___ ____ _/ /_____ _____ / /_ / __ `/ ___/ __/ __/ / ___/ __/ / __ `__ \/ __ `/ __/ __ \/ ___/ / __/ / /_/ (__ ) /_/ /___(__ ) /_/ / / / / / / /_/ / /_/ /_/ / / /_/ \__,_/____/\__/_____/____/\__/_/_/ /_/ /_/\__,_/\__/\____/_/ FastEstimator-Warn: No ModelSaver Trace detected. Models will not be saved. TensorShape([3, 10]) (3,) TensorShape([3, 10]) (3,) TensorShape([3, 10]) (3,) TensorShape([3, 10]) (3,) TensorShape([3, 10]) (3,) TensorShape([3, 10]) (3,) TensorShape([3, 10]) (3,) TensorShape([3, 10]) (3,)
Debugging and Verifying the Network Results¶
Now let's look at how to verify the output of the network using network.transform
network = fe.Network(ops=[ModelOp(model=model, inputs="x_out", outputs="y_pred"), #default mode=None
CrossEntropy(inputs=("y_pred", "y"), outputs="ce", mode="!infer"),
UpdateOp(model=model, loss_name="ce", mode="train")])
We will take the output of the pipeline and feed it into the network to verify that the network is giving us the output as was expected
test_data = pipeline.get_results(mode="test")
test_data = network.transform(test_data, mode="test")
print('Labels: ', test_data['y'])
print('Predictions: ', np.argmax(test_data['y_pred'].numpy(), axis=1))
Labels: tf.Tensor([3 3 2], shape=(3,), dtype=uint8) Predictions: [3 9 3]
Trace Debugging¶
Conditional Debugging During Training¶
What if our training is giving some weird result for the specific sample after a certain number of training epochs? How do you debug such sample data based on specific conditions?
In Tutorial 7 we introduced the Trace
and it's various use cases during the training. Since the trace allows us to control the training loop it is possible to add conditions that suit your needs to debug the training code.
Let's write a trace that prints the predictions and label values for second batch during the training. To access the training information inside the Trace
, you can use the System
instance as explained in Advanced Tutorial 4.
from fastestimator.trace import Trace
from fastestimator.trace.metric import Accuracy
class MonitorResult(Trace):
def on_batch_end(self, data):
if self.system.batch_idx == 2:
predictions = np.argmax(data[self.inputs[1]].numpy(), axis=1)
print('Current global step: ', self.system.global_step)
print("Batch true labels: ", data[self.inputs[0]])
print("Batch predictictions: ", predictions)
traces = [
Accuracy(true_key="y", pred_key="y_pred"),
MonitorResult(inputs=("y", "y_pred"), mode='train')
]
estimator = fe.Estimator(pipeline=pipeline,
network=network,
epochs=3,
traces=traces,
train_steps_per_epoch=15,
log_steps=None)
estimator.fit()
______ __ ______ __ _ __ / ____/___ ______/ /_/ ____/____/ /_(_)___ ___ ____ _/ /_____ _____ / /_ / __ `/ ___/ __/ __/ / ___/ __/ / __ `__ \/ __ `/ __/ __ \/ ___/ / __/ / /_/ (__ ) /_/ /___(__ ) /_/ / / / / / / /_/ / /_/ /_/ / / /_/ \__,_/____/\__/_____/____/\__/_/_/ /_/ /_/\__,_/\__/\____/_/ FastEstimator-Warn: No ModelSaver Trace detected. Models will not be saved. WARNING:tensorflow:5 out of the last 5 calls to <function TFNetwork._forward_step_static at 0x7ff2ebd88700> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. Current global step: 2 Batch true labels: tf.Tensor([6 2 6], shape=(3,), dtype=uint8) Batch predictictions: [4 3 4] Current global step: 17 Batch true labels: tf.Tensor([4 1 7], shape=(3,), dtype=uint8) Batch predictictions: [6 1 1] Current global step: 32 Batch true labels: tf.Tensor([7 5 1], shape=(3,), dtype=uint8) Batch predictictions: [4 3 2]
Debugging Pipeline and Network From a Trace¶
Trace can also be used to debug the results of Pipeline
and Network
. In the previous example, we print the predictions and labels for epoch 2 and 3. But what if we want to debug the actual pipeline data used in the training.
We can use the same conditional debugging as before, only this time printing the results of the Pipeline
. Let's write a trace that prints the pipeline output when the loss value crosses 2.
model = fe.build(model_fn=LeNet, optimizer_fn="adam")
class MonitorPipelineResults(Trace):
def __init__(self, true_key, pred_key, mode="train"):
super().__init__(inputs=(true_key, pred_key), mode=mode)
def on_batch_end(self, data):
if data['ce'] > 3 and self.system.epoch_idx >= 2:
print('\nLoss is above 3. Check the pipeline results!')
print(data['x_out'])
traces = [
Accuracy(true_key="y", pred_key="y_pred"),
MonitorPipelineResults(true_key="y", pred_key="y_pred")
]
estimator = fe.Estimator(pipeline=pipeline,
network=network,
epochs=3,
traces=traces,
train_steps_per_epoch=2)
estimator.fit()
______ __ ______ __ _ __ / ____/___ ______/ /_/ ____/____/ /_(_)___ ___ ____ _/ /_____ _____ / /_ / __ `/ ___/ __/ __/ / ___/ __/ / __ `__ \/ __ `/ __/ __ \/ ___/ / __/ / /_/ (__ ) /_/ /___(__ ) /_/ / / / / / / /_/ / /_/ /_/ / / /_/ \__,_/____/\__/_____/____/\__/_/_/ /_/ /_/\__,_/\__/\____/_/ FastEstimator-Warn: No ModelSaver Trace detected. Models will not be saved. FastEstimator-Start: step: 1; logging_interval: 100; num_device: 0; FastEstimator-Train: step: 1; ce: 2.977018; FastEstimator-Train: step: 2; epoch: 1; epoch_time(sec): 1.65; Eval Progress: 1/1666; Eval Progress: 555/1666; steps/sec: 586.37; Eval Progress: 1110/1666; steps/sec: 628.18; Eval Progress: 1666/1666; steps/sec: 633.28; FastEstimator-Eval: step: 2; epoch: 1; accuracy: 0.3306; ce: 1.9816521; FastEstimator-Train: step: 4; epoch: 2; epoch_time(sec): 1.71; Eval Progress: 1/1666; Eval Progress: 555/1666; steps/sec: 548.44; Eval Progress: 1110/1666; steps/sec: 622.79; Eval Progress: 1666/1666; steps/sec: 632.5; FastEstimator-Eval: step: 4; epoch: 2; accuracy: 0.3474; ce: 1.949829; FastEstimator-Train: step: 6; epoch: 3; epoch_time(sec): 1.51; Eval Progress: 1/1666; Eval Progress: 555/1666; steps/sec: 476.68; Eval Progress: 1110/1666; steps/sec: 548.92; Eval Progress: 1666/1666; steps/sec: 537.1; FastEstimator-Eval: step: 6; epoch: 3; accuracy: 0.3552; ce: 1.9489236; FastEstimator-Finish: step: 6; model_lr: 0.001; total_time(sec): 18.76;