Model Quantization Tips

This tutorial provides several tips for how to better quantize your model.

Model quantization uses the TfliteConverter to convert a model with float32 tensors into a model with int8 tensors. A model with int8 tensors typically reduces the memory requirements by 4x and also allows for better hardware acceleration.

While model quantization can greatly reduce memory and computational overhead, it can also reduce model accuracy. This is because converting from float32 (32-bits) to int8 (8-bits) looses information. So, for quantization to be effective, it must accurately represent the original 32-bit data in 8-bits as efficiently as possible. While the TfliteConverter does most of the work for you, the following provides some tips on how you can help the model quantize even better.

Contents

This tutorial is divided into the following sections:

  1. Quantization Report - Describes how to use the MLTK to generate a quantization report to aid with debugging quantization errors

  2. Input Data Normalization - Describes how to normalize the input data so that it can be better quantized

  3. Normalization Layers - Describes some of the ML layers that allow for better model quantization

Quantization Report

Tensorflow-Lite comes with an experimental Quantization Debugger:

Although full-integer quantization provides improved model size and latency, the quantized model won’t always work as expected. It’s usually expected for the model quality (e.g. accuracy, mAP, WER) to be slightly lower than the original float model. However, there are cases where the model quality can go below your expectation or generated completely wrong results.

When this problem happens, it’s tricky and painful to spot the root cause of the quantization error, and it’s even more difficult to fix the quantization error. To assist this model inspection process, quantization debugger can be used to identify problematic layers, and selective quantization can leave those problematic layers in float so that the model accuracy can be recovered at the cost of reduced benefit from quantization.

Quantization debugger makes it possible to do quantization quality metric analysis in the existing model. Quantization debugger can automate processes for running model with a debug dataset, and collecting quantization quality metrics for each tensors.

Enabling the Quantization Report

The MLTK will automatically generate a quantization report during model quantization by setting the tflite_converter setting:

my_model.tflite_converter['generate_quantization_report'] = True

The quantization report is a standard .csv file and is added to the Model Archive file (it is also generated in the same directory as the .tflite, typically ~/.mtlk/models/<model name>/quantization_report.csv).

Analyzing the Report

For each row in the report, the op name and index comes first, followed by quantization parameters and error metrics.

Additionally, per the Data Analysis section, two additional metrics are calculated:

  • Range - scale * 255.0, this provides the range of input values that the given layer should accept. Ideally, this value should be less than 255. Larger values could cause quantization problems at runtime. (Recall that the quantized values must fit within an int8 data type)

  • RMSE / scale - sqrt(mean_squared_error) / scale, this value is close to 1 / sqrt(12) (~ 0.289) when quantized distribution is similar to the original float distribution, indicating a good quantized model. The larger the value is, the more likely the layer is not being quantized well.

So, when viewing a quantization report, if a layer has large values in the range and/or rmse/scale columns then it could be that the model is not quantizing well. Refer to the next sections for how to fix these issues.

Input Data Normalization

For the best quantization, the range of input values should be evenly distributed around -1.0 to 1.0. While this is not a hard rule, in practice is was found that rmse/scale has a lower value when the input values are within this range.

The following are methods to normalize the input data:

Scale by a constant

Scaling by a constant is a computationally efficient method for data normalization:

normalized_input_sample = input_sample / <scaler>

In the training Python scripts, the following may be used:

# -----------------------------------------
# Define the input scaling value
# This value should be near the upper limit of the input data
input_scaling_value = 255.0

# -----------------------------------------
# Add the scaling value to the model parameters.
# This allows the embedded device to access the scaling value at runtime
# Note, we save the scaler *reciprocal*, as multiplication is a more efficient op than division on embedded
my_model.model_parameters['samplewise_norm.rescale'] = 1.0 / input_scaling_value

# -----------------------------------------
# Ensure the input/output data types of the quantized model are float32
my_model.tflite_converter['inference_input_type'] = np.float32
my_model.tflite_converter['inference_output_type'] = np.float32
# Generate a quantization report to help with debugging quantization errors
my_model.tflite_converter['generate_quantization_report'] = True


# -----------------------------------------
# Later, in the training data pipeline, convert the data to float32 and scale the input data
x = x.astype(np.float32)
x /= input_scaling_value

At runtime on the embedded device, the input data must also be converted to float32 and scaled. The following may be used:

#include "tflite_micro_model/tflite_micro_model.hpp"

using namespace mltk;

// Assume the source input data is in uint16 format
extern uint16_t *source_input_data;
// This is defined by the build scripts
// which converts the specified .tflite to a C array
extern "C" const uint8_t sl_tflite_model_array[];


void main()
{
    TfliteMicroModel model;

    // Load the quantized .tflite model
    model.load(sl_tflite_model_array);

    // Retrieve the input scaler from the .tflite
    float input_scaler;
    model.parameters.get("samplewise_norm.rescale", input_scaler);

    // Obtain a pointer to the input tensor which is in float32 format
    TfliteTensorView *input = model.inputs();

    // Scale the input data
    for(int i = 0; i < input->shape().flat_size(); ++i)
    {
        input->data.f[i] = (float)source_input_data[i] * input_scaler;
    }

    // Run inference on the scaled input data
    model.invoke();

    // Do something with the results
    TfliteTensorView results = model.output();
}

Center about mean and scale by STD

A more robust normalization method is to center the data about the mean and scale by the standard deviation:

normalized_input_sample = (input_sample - mean(input_sample)) / std(input_sample)

In the training Python scripts, the following may be used:

# -----------------------------------------
# This tells the embedded device to normalized by the mean and STD
my_model.model_parameters['samplewise_norm.mean_and_std'] = True

# -----------------------------------------
# Ensure the input/output data types of the quantized model are float32
my_model.tflite_converter['inference_input_type'] = np.float32
my_model.tflite_converter['inference_output_type'] = np.float32
# Generate a quantization report to help with debugging quantization errors
my_model.tflite_converter['generate_quantization_report'] = True


# -----------------------------------------
# Later, in the training data pipeline, convert the data to float32 and normalize
x = x.astype(np.float32)
x -= np.mean(x, dtype=np.float32, keepdims=True)
x /= (np.std(x, dtype=np.float32, keepdims=True) + 1e-6)

At runtime on the embedded device, the input data must also be converted to float32 and normalized. The following may be used:

#include "tflite_micro_model/tflite_micro_model.hpp"
#include "tflite_micro_model/tflite_micro_utils.hpp"

using namespace mltk;

// Assume the source input data is in uint16 format
extern uint16_t *source_input_data;
// This is defined by the build scripts
// which converts the specified .tflite to a C array
extern "C" const uint8_t sl_tflite_model_array[];


void main()
{
    TfliteMicroModel model;

    // Load the quantized .tflite model
    model.load(sl_tflite_model_array);

    // Retrieve the input scaler from the .tflite
    bool mean_and_std_enabled;
    model.parameters.get("samplewise_norm.mean_and_std", mean_and_std_enabled);

    // Obtain a pointer to the input tensor which is in float32 format
    TfliteTensorView *input = model.inputs();

    // Use the helper function to normalize the input buffer
    samplewise_mean_std_tensor(source_input_data, input->data.f, input->shape().flat_size());

    // Run inference on the scaled input data
    model.invoke();

    // Do something with the results
    TfliteTensorView results = model.output();
}

Normalization Layers

For the best quantization, we want the input data to be evenly distributed around 0.0. The same is true for the inputs to each of the layers of the model.

To help with this, Tensorflow / Keras offer several layers that will normalize the outputs of the preceding layers:

Batch Normalization

The Batch Normalization applies a transformation that maintains the mean output close to 0 and the output standard deviation close to 1.

This layer is commonly used in many model architectures. Additionally, if used properly, it can be fused with other layers so that there is minimal runtime overhead.

The following are some examples of how to use the Batch Normalization layer so that it fused with other layers. In this way:

  1. The input to the following layer is centered about 0

  2. The qunatized layer is fused and introduces minimal runtime overhead on the embedded device

NOTE: The key for BatchNorm fusion is to invoke the activation after the BatchNorm layer.

Conv2D + Batch Normalization

The following provides an example of how to use batch normalization with a Conv2D layer:


# Define the Conv2D layer *without* an activation
x = tf.keras.layers.Conv2D(
    filters=32,
    kernel_size=(3, 3),
    strides=1,
)(x)

# Apply batch normalization
x = tf.keras.layers.BatchNormalization()(x)

# Apply the ReLU activation *after* the batch norm
x = tf.keras.layers.ReLU()(x)

Fully Connected + Batch Normalization

The following provides an example of how to use batch normalization with a Dense (aka Fully Connected) layer:


# Define the Dense layer *without* an activation
x = tf.keras.layers.Dense(10)(x)

# Apply batch normalization
x = tf.keras.layers.BatchNormalization()(x)

# Apply the ReLU activation *after* the batch norm
x = tf.keras.layers.ReLU()(x)

LayerNormalization

LayerNormalization normalizes the activations of the previous layer for each given example in a batch independently, rather than across a batch like Batch Normalization. i.e. applies a transformation that maintains the mean activation within each example close to 0 and the activation standard deviation close to 1.

While this layer introduces more overhead than Batch Normalization, it is applied on a per-sample basis (as opposed to a per-batch like BatchNorm) which is useful for models that maintain a memory (e.g. recurrent networks).

This layer should be applied before the activation, e.g.:

x = tf.keras.layers.Dense(model.n_classes)(x)
x = tf.keras.layers.LayerNormalization()(x)
x = tf.keras.layers.Softmax()(x)