Tutorial 11: Debugging¶

Overview¶

In this tutorial we are going to cover:

Pipeline Debugging
- Debugging a Single NumpyOp
- Debugging and Verifying the Pipeline Results
Network Debugging
- Debugging a Single TensorOp
- Debugging and Verifying the Network Results
Trace Debugging
- Conditional Debugging During Training
- Debugging Pipeline and Network from a Trace

Pipeline Debugging¶

In Tutorial 4 we demonstrated what the Pipeline is and how it is created to handle different preprocessing tasks using NumpyOps. Since the pipeline consists of a series of NumpyOps, it's a vital to know how to debug those NumpyOps. It is also a good practice to inspect the results of the pipeline and ensure that the output is as you expected.

There are two ways we can debug the Pipeline,

Debug a single NumpyOp
Debug and verify the results of Pipeline

Debugging a Single NumpyOp¶

We will first create a simple Pipeline with a few NumpyOps that add random noise and rotate the image.

Now, if we want to debug the variable values in the AddNoise op we will do the following,

Set the num_process=0 to disable the multiprocessing
Add your choice of a debugger such as the python debugger (PDB), an IDE specific debugger, or print statement in the NumpyOp

In [1]:

Copied!





import numpy as np
import fastestimator as fe

from fastestimator.dataset.data import mnist
from fastestimator.op.numpyop import NumpyOp
from fastestimator.op.numpyop.multivariate import Rotate

from fastestimator.architecture.tensorflow import LeNet
import numpy as np
import fastestimator as fe

from fastestimator.dataset.data import mnist
from fastestimator.op.numpyop import NumpyOp
from fastestimator.op.numpyop.multivariate import Rotate

from fastestimator.architecture.tensorflow import LeNet

In [2]:

Copied!

train_data, eval_data = mnist.load_data()
test_data = eval_data.split(0.5)
model = fe.build(model_fn=LeNet, optimizer_fn="adam")
train_data, eval_data = mnist.load_data()
test_data = eval_data.split(0.5)
model = fe.build(model_fn=LeNet, optimizer_fn="adam")

/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:80: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0
2022-11-04 18:48:06.374570: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: UNKNOWN ERROR (34)
2022-11-04 18:48:06.374629: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
2022-11-04 18:48:06.374821: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

Single NumpyOp can be debugged in two ways,

Using Pipeline.transform
Running the training loop

Using Pipeline.transform¶

In [3]:

Copied!





class AddNoiseDebug(NumpyOp):
    def __init__(self, inputs, outputs, mode = None):
        super().__init__(inputs, outputs, mode)

    def forward(self, data, state):
        noise = np.random.normal(0, 1, size=data.shape)
        print('Noise shape ', noise.shape)
        print('data shape ', data.shape) # add print statement to check the data and noise
        data = data + noise
        return data
class AddNoiseDebug(NumpyOp):
    def __init__(self, inputs, outputs, mode = None):
        super().__init__(inputs, outputs, mode)

    def forward(self, data, state):
        noise = np.random.normal(0, 1, size=data.shape)
        print('Noise shape ', noise.shape)
        print('data shape ', data.shape) # add print statement to check the data and noise
        data = data + noise
        return data

In [4]:

Copied!





debug_pipeline = fe.Pipeline(train_data=train_data,
                       eval_data=eval_data,
                       test_data=test_data,
                       batch_size=3,
                       ops=[AddNoiseDebug(inputs='x', outputs='x_out'),
                            Rotate(image_in="x_out", image_out="x_out", limit=60)], 
                             num_process=0)
debug_pipeline = fe.Pipeline(train_data=train_data,
                       eval_data=eval_data,
                       test_data=test_data,
                       batch_size=3,
                       ops=[AddNoiseDebug(inputs='x', outputs='x_out'),
                            Rotate(image_in="x_out", image_out="x_out", limit=60)], 
                             num_process=0)

In [5]:

Copied!

results = debug_pipeline.transform(train_data[0], mode='train')
results = debug_pipeline.transform(train_data[0], mode='train')

Noise shape  (28, 28)
data shape  (28, 28)

Running the Training Loop¶

In [6]:

Copied!

from fastestimator.op.tensorop.loss import CrossEntropy
from fastestimator.op.tensorop.model import ModelOp, UpdateOp
from fastestimator.op.tensorop.loss import CrossEntropy
from fastestimator.op.tensorop.model import ModelOp, UpdateOp

In [7]:

Copied!

network = fe.Network(ops=[ModelOp(model=model, inputs="x_out", outputs="y_pred"),
                          CrossEntropy(inputs=("y_pred", "y"), outputs="ce", mode="!infer"),
                          UpdateOp(model=model, loss_name="ce", mode="train")])
network = fe.Network(ops=[ModelOp(model=model, inputs="x_out", outputs="y_pred"),
                          CrossEntropy(inputs=("y_pred", "y"), outputs="ce", mode="!infer"),
                          UpdateOp(model=model, loss_name="ce", mode="train")])

NOTE: The training logs will print extra debug messages if warmup is not set to False in Estimator.fit. When warmup is enabled it will perform a test run on both Pipeline and Network to make sure that training will not fail in later stages.

In [8]:

Copied!





estimator = fe.Estimator(pipeline=debug_pipeline,
                         network=network,
                         epochs=1,
                         train_steps_per_epoch=1, 
                         eval_steps_per_epoch=1)
estimator.fit(warmup=False)
estimator = fe.Estimator(pipeline=debug_pipeline,
                         network=network,
                         epochs=1,
                         train_steps_per_epoch=1, 
                         eval_steps_per_epoch=1)
estimator.fit(warmup=False)

    ______           __  ______     __  _                 __            
   / ____/___ ______/ /_/ ____/____/ /_(_)___ ___  ____ _/ /_____  _____
  / /_  / __ `/ ___/ __/ __/ / ___/ __/ / __ `__ \/ __ `/ __/ __ \/ ___/
 / __/ / /_/ (__  ) /_/ /___(__  ) /_/ / / / / / / /_/ / /_/ /_/ / /    
/_/    \__,_/____/\__/_____/____/\__/_/_/ /_/ /_/\__,_/\__/\____/_/     
                                                                        

FastEstimator-Warn: No ModelSaver Trace detected. Models will not be saved.
FastEstimator-Start: step: 1; logging_interval: 100; num_device: 0;
Noise shape  (28, 28)
data shape  (28, 28)
FastEstimator-Warn: The following key(s) are being pruned since they are unused outside of the Pipeline. To prevent this, you can declare the key(s) as inputs to Traces or TensorOps: x
FastEstimator-Train: step: 1; ce: 27.005692;
FastEstimator-Train: step: 1; epoch: 1; epoch_time(sec): 2.64;
Noise shape  (28, 28)
data shape  (28, 28)
Eval Progress: 1/1;
FastEstimator-Eval: step: 1; epoch: 1; ce: 19.016918;
FastEstimator-Finish: step: 1; model_lr: 0.001; total_time(sec): 2.89;

Debugging and Verifying the Pipeline Results¶

In order to debug and verify the pipeline results, we will use the pipeline.get_results(). You can also visualize the results using the utility functions.

In [9]:

Copied!

from fastestimator.util import BatchDisplay, GridDisplay, ImageDisplay

data = debug_pipeline.get_results()
img = GridDisplay([BatchDisplay(image=data["x"], title="original image"), BatchDisplay(image=data["x_out"], title="pipeline output")])
img.show()
from fastestimator.util import BatchDisplay, GridDisplay, ImageDisplay

data = debug_pipeline.get_results()
img = GridDisplay([BatchDisplay(image=data["x"], title="original image"), BatchDisplay(image=data["x_out"], title="pipeline output")])
img.show()

Noise shape  (28, 28)
data shape  (28, 28)
Noise shape  (28, 28)
data shape  (28, 28)
Noise shape  (28, 28)
data shape  (28, 28)

No description has been provided for this image

Network Debugging¶

Network defines the model and the operations that need to be performed on it. It is composed of series of TensorOps and can be debugged in similar fashion as the Pipeline.

Debugging a single TensorOp
Verifying and debugging the network results

Debugging a Single TensorOp¶

A TensorOp can be debugged in two ways,

Using network.transform
Running the training loop

Using Network.transform¶

We will add a Custom TensorOp that will print the value of prediction and run the forward step through the network using network.transform.

In [10]:

Copied!

import tensorflow as tf
from fastestimator.op.tensorop import TensorOp
import tensorflow as tf
from fastestimator.op.tensorop import TensorOp

In [11]:

Copied!





class CustomTensorOp(TensorOp):
    def forward(self, data, state):
        pred = data[0]
        labels = data[1]
        print('Predictions:\n ', pred)
class CustomTensorOp(TensorOp):
    def forward(self, data, state):
        pred = data[0]
        labels = data[1]
        print('Predictions:\n ', pred)

In [12]:

Copied!





pipeline = fe.Pipeline(train_data=train_data,
                       eval_data=eval_data,
                       test_data=test_data,
                       batch_size=3,
                       ops=[Rotate(image_in="x", image_out="x_out", limit=60)])
pipeline = fe.Pipeline(train_data=train_data,
                       eval_data=eval_data,
                       test_data=test_data,
                       batch_size=3,
                       ops=[Rotate(image_in="x", image_out="x_out", limit=60)])

In [13]:

Copied!





network = fe.Network(ops=[ModelOp(model=model, inputs="x_out", outputs="y_pred"), #default mode=None
                          CrossEntropy(inputs=("y_pred", "y"), outputs="ce"),
                          CustomTensorOp(inputs=("y_pred", "y")),
                          UpdateOp(model=model, loss_name="ce", mode="train")])
network = fe.Network(ops=[ModelOp(model=model, inputs="x_out", outputs="y_pred"), #default mode=None
                          CrossEntropy(inputs=("y_pred", "y"), outputs="ce"),
                          CustomTensorOp(inputs=("y_pred", "y")),
                          UpdateOp(model=model, loss_name="ce", mode="train")])

In [14]:

Copied!

test_data = pipeline.get_results(mode="test")
test_data = network.transform(test_data, mode="test")
test_data = pipeline.get_results(mode="test")
test_data = network.transform(test_data, mode="test")

Predictions:
  tf.Tensor(
[[1.84605952e-13 1.35236891e-21 8.96912971e-26 7.53040075e-10
  2.92707147e-09 4.52285354e-24 2.93159675e-13 1.13306120e-09
  1.00000000e+00 4.61551404e-14]
 [8.96088013e-15 1.07297582e-11 6.47558979e-22 4.55104646e-06
  1.12154931e-01 8.58035724e-24 2.32068499e-16 8.77452672e-01
  1.03877764e-02 4.05513090e-09]
 [1.15011451e-12 1.38708302e-24 1.12393150e-11 1.17335359e-08
  7.94216305e-07 1.21484445e-23 3.15572667e-20 1.02369202e-09
  2.45853816e-03 9.97540593e-01]], shape=(3, 10), dtype=float32)

As you can see, the predictions are getting displayed from the print statement in the CustomTensorOp

Running the Training Loop¶

In [15]:

Copied!





class CustomTensorOp(TensorOp):
    def forward(self, data, state):
        pred = data[0]
        labels = data[1]
        tf.print(pred.shape) # tf.print will work in both graph mode and eager mode
        print(labels.shape) # print only work in eager mode for tensorflow
class CustomTensorOp(TensorOp):
    def forward(self, data, state):
        pred = data[0]
        labels = data[1]
        tf.print(pred.shape) # tf.print will work in both graph mode and eager mode
        print(labels.shape) # print only work in eager mode for tensorflow

In [16]:

Copied!





network = fe.Network(ops=[ModelOp(model=model, inputs="x_out", outputs="y_pred"), #default mode=None
                          CrossEntropy(inputs=("y_pred", "y"), outputs="ce"),
                          CustomTensorOp(inputs=("y_pred", "y")),
                          UpdateOp(model=model, loss_name="ce", mode="train")])
network = fe.Network(ops=[ModelOp(model=model, inputs="x_out", outputs="y_pred"), #default mode=None
                          CrossEntropy(inputs=("y_pred", "y"), outputs="ce"),
                          CustomTensorOp(inputs=("y_pred", "y")),
                          UpdateOp(model=model, loss_name="ce", mode="train")])

If you are using the Tensorflow backend it's important to set eager=True in the Estimator.fit() or can use the tf.print function to print in graph mode. For the PyTorch backend, you can use any of your favorite debuggers and no graph mode settings are required.

In [17]:

Copied!





estimator = fe.Estimator(pipeline=pipeline,
                         network=network,
                         epochs=2,
                         train_steps_per_epoch=2, 
                         eval_steps_per_epoch=1,
                         log_steps=None)
estimator.fit(eager=True)
estimator = fe.Estimator(pipeline=pipeline,
                         network=network,
                         epochs=2,
                         train_steps_per_epoch=2, 
                         eval_steps_per_epoch=1,
                         log_steps=None)
estimator.fit(eager=True)

    ______           __  ______     __  _                 __            
   / ____/___ ______/ /_/ ____/____/ /_(_)___ ___  ____ _/ /_____  _____
  / /_  / __ `/ ___/ __/ __/ / ___/ __/ / __ `__ \/ __ `/ __/ __ \/ ___/
 / __/ / /_/ (__  ) /_/ /___(__  ) /_/ / / / / / / /_/ / /_/ /_/ / /    
/_/    \__,_/____/\__/_____/____/\__/_/_/ /_/ /_/\__,_/\__/\____/_/     
                                                                        

FastEstimator-Warn: No ModelSaver Trace detected. Models will not be saved.
TensorShape([3, 10])
(3,)
TensorShape([3, 10])
(3,)
TensorShape([3, 10])
(3,)
TensorShape([3, 10])
(3,)
TensorShape([3, 10])
(3,)
TensorShape([3, 10])
(3,)
TensorShape([3, 10])
(3,)
TensorShape([3, 10])
(3,)

Debugging and Verifying the Network Results¶

Now let's look at how to verify the output of the network using network.transform

In [18]:

Copied!

network = fe.Network(ops=[ModelOp(model=model, inputs="x_out", outputs="y_pred"), #default mode=None
                          CrossEntropy(inputs=("y_pred", "y"), outputs="ce", mode="!infer"),
                          UpdateOp(model=model, loss_name="ce", mode="train")])
network = fe.Network(ops=[ModelOp(model=model, inputs="x_out", outputs="y_pred"), #default mode=None
                          CrossEntropy(inputs=("y_pred", "y"), outputs="ce", mode="!infer"),
                          UpdateOp(model=model, loss_name="ce", mode="train")])

We will take the output of the pipeline and feed it into the network to verify that the network is giving us the output as was expected

In [19]:

Copied!





test_data = pipeline.get_results(mode="test")
test_data = network.transform(test_data, mode="test")
print('Labels: ', test_data['y'])
print('Predictions: ', np.argmax(test_data['y_pred'].numpy(), axis=1))
test_data = pipeline.get_results(mode="test")
test_data = network.transform(test_data, mode="test")
print('Labels: ', test_data['y'])
print('Predictions: ', np.argmax(test_data['y_pred'].numpy(), axis=1))

Labels:  tf.Tensor([3 3 2], shape=(3,), dtype=uint8)
Predictions:  [3 9 3]

Trace Debugging¶

Conditional Debugging During Training¶

What if our training is giving some weird result for the specific sample after a certain number of training epochs? How do you debug such sample data based on specific conditions?

In Tutorial 7 we introduced the Trace and it's various use cases during the training. Since the trace allows us to control the training loop it is possible to add conditions that suit your needs to debug the training code.

Let's write a trace that prints the predictions and label values for second batch during the training. To access the training information inside the Trace, you can use the System instance as explained in Advanced Tutorial 4.

In [20]:

Copied!

from fastestimator.trace import Trace
from fastestimator.trace.metric import Accuracy
from fastestimator.trace import Trace
from fastestimator.trace.metric import Accuracy

In [21]:

Copied!





class MonitorResult(Trace):
        
    def on_batch_end(self, data):
        if self.system.batch_idx == 2:
            predictions = np.argmax(data[self.inputs[1]].numpy(), axis=1)
            print('Current global step: ', self.system.global_step)
            print("Batch true labels: ", data[self.inputs[0]])
            print("Batch predictictions: ", predictions)
class MonitorResult(Trace):
        
    def on_batch_end(self, data):
        if self.system.batch_idx == 2:
            predictions = np.argmax(data[self.inputs[1]].numpy(), axis=1)
            print('Current global step: ', self.system.global_step)
            print("Batch true labels: ", data[self.inputs[0]])
            print("Batch predictictions: ", predictions)

In [22]:

Copied!





traces = [
    Accuracy(true_key="y", pred_key="y_pred"),
    MonitorResult(inputs=("y", "y_pred"), mode='train')
]
estimator = fe.Estimator(pipeline=pipeline,
                         network=network,
                         epochs=3,
                         traces=traces,
                         train_steps_per_epoch=15, 
                         log_steps=None)
estimator.fit()
traces = [
    Accuracy(true_key="y", pred_key="y_pred"),
    MonitorResult(inputs=("y", "y_pred"), mode='train')
]
estimator = fe.Estimator(pipeline=pipeline,
                         network=network,
                         epochs=3,
                         traces=traces,
                         train_steps_per_epoch=15, 
                         log_steps=None)
estimator.fit()

    ______           __  ______     __  _                 __            
   / ____/___ ______/ /_/ ____/____/ /_(_)___ ___  ____ _/ /_____  _____
  / /_  / __ `/ ___/ __/ __/ / ___/ __/ / __ `__ \/ __ `/ __/ __ \/ ___/
 / __/ / /_/ (__  ) /_/ /___(__  ) /_/ / / / / / / /_/ / /_/ /_/ / /    
/_/    \__,_/____/\__/_____/____/\__/_/_/ /_/ /_/\__,_/\__/\____/_/     
                                                                        

FastEstimator-Warn: No ModelSaver Trace detected. Models will not be saved.
WARNING:tensorflow:5 out of the last 5 calls to <function TFNetwork._forward_step_static at 0x7ff2ebd88700> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
Current global step:  2
Batch true labels:  tf.Tensor([6 2 6], shape=(3,), dtype=uint8)
Batch predictictions:  [4 3 4]
Current global step:  17
Batch true labels:  tf.Tensor([4 1 7], shape=(3,), dtype=uint8)
Batch predictictions:  [6 1 1]
Current global step:  32
Batch true labels:  tf.Tensor([7 5 1], shape=(3,), dtype=uint8)
Batch predictictions:  [4 3 2]

Debugging Pipeline and Network From a Trace¶

Trace can also be used to debug the results of Pipeline and Network. In the previous example, we print the predictions and labels for epoch 2 and 3. But what if we want to debug the actual pipeline data used in the training.

We can use the same conditional debugging as before, only this time printing the results of the Pipeline. Let's write a trace that prints the pipeline output when the loss value crosses 2.

In [23]:

Copied!

model = fe.build(model_fn=LeNet, optimizer_fn="adam")
model = fe.build(model_fn=LeNet, optimizer_fn="adam")

In [24]:

Copied!





class MonitorPipelineResults(Trace):
    def __init__(self, true_key, pred_key, mode="train"):
        super().__init__(inputs=(true_key, pred_key), mode=mode)
        
    def on_batch_end(self, data):
        if data['ce'] > 3 and self.system.epoch_idx >= 2:
            print('\nLoss is above 3. Check the pipeline results!')
            print(data['x_out'])
class MonitorPipelineResults(Trace):
    def __init__(self, true_key, pred_key, mode="train"):
        super().__init__(inputs=(true_key, pred_key), mode=mode)
        
    def on_batch_end(self, data):
        if data['ce'] > 3 and self.system.epoch_idx >= 2:
            print('\nLoss is above 3. Check the pipeline results!')
            print(data['x_out'])

In [25]:

Copied!





traces = [
    Accuracy(true_key="y", pred_key="y_pred"),
    MonitorPipelineResults(true_key="y", pred_key="y_pred")
]
estimator = fe.Estimator(pipeline=pipeline,
                         network=network,
                         epochs=3,
                         traces=traces,
                         train_steps_per_epoch=2)
estimator.fit()
traces = [
    Accuracy(true_key="y", pred_key="y_pred"),
    MonitorPipelineResults(true_key="y", pred_key="y_pred")
]
estimator = fe.Estimator(pipeline=pipeline,
                         network=network,
                         epochs=3,
                         traces=traces,
                         train_steps_per_epoch=2)
estimator.fit()

    ______           __  ______     __  _                 __            
   / ____/___ ______/ /_/ ____/____/ /_(_)___ ___  ____ _/ /_____  _____
  / /_  / __ `/ ___/ __/ __/ / ___/ __/ / __ `__ \/ __ `/ __/ __ \/ ___/
 / __/ / /_/ (__  ) /_/ /___(__  ) /_/ / / / / / / /_/ / /_/ /_/ / /    
/_/    \__,_/____/\__/_____/____/\__/_/_/ /_/ /_/\__,_/\__/\____/_/     
                                                                        

FastEstimator-Warn: No ModelSaver Trace detected. Models will not be saved.
FastEstimator-Start: step: 1; logging_interval: 100; num_device: 0;
FastEstimator-Train: step: 1; ce: 2.977018;
FastEstimator-Train: step: 2; epoch: 1; epoch_time(sec): 1.65;
Eval Progress: 1/1666;
Eval Progress: 555/1666; steps/sec: 586.37;
Eval Progress: 1110/1666; steps/sec: 628.18;
Eval Progress: 1666/1666; steps/sec: 633.28;
FastEstimator-Eval: step: 2; epoch: 1; accuracy: 0.3306; ce: 1.9816521;
FastEstimator-Train: step: 4; epoch: 2; epoch_time(sec): 1.71;
Eval Progress: 1/1666;
Eval Progress: 555/1666; steps/sec: 548.44;
Eval Progress: 1110/1666; steps/sec: 622.79;
Eval Progress: 1666/1666; steps/sec: 632.5;
FastEstimator-Eval: step: 4; epoch: 2; accuracy: 0.3474; ce: 1.949829;
FastEstimator-Train: step: 6; epoch: 3; epoch_time(sec): 1.51;
Eval Progress: 1/1666;
Eval Progress: 555/1666; steps/sec: 476.68;
Eval Progress: 1110/1666; steps/sec: 548.92;
Eval Progress: 1666/1666; steps/sec: 537.1;
FastEstimator-Eval: step: 6; epoch: 3; accuracy: 0.3552; ce: 1.9489236;
FastEstimator-Finish: step: 6; model_lr: 0.001; total_time(sec): 18.76;