Keyword Spotting - On/Off

This tutorial describes how to use the MLTK to develop a machine learning model to detect the keywords:

  • On

  • Off

Overview

Objectives

After completing this tutorial, you will have:

  1. A better understanding of how keyword-spotting (KWS) machine learning models work

  2. All of the tools needed to develop your own KWS machine learning model

  3. A working demo to turn an LED on/off based on the voice commands of your choice

Content

This tutorial is divided into the following sections:

  1. Overview of machine learning and keyword-spotting

  2. Dataset selection and preprocessing parameters

  3. Creating the model specification

  4. Visualizing the audio dataset

  5. Note about model parameters

  6. Summarizing the model

  7. Visualizing the model graph

  8. Profiling the model

  9. Training the model

  10. Evaluating the model

  11. Testing the model

  12. Deploying the model to an embedded device

Running this tutorial from a notebook

For documentation purposes, this tutorial was designed to run within a Jupyter Notebook. The notebook can either run locally on your PC or on a remote server like Google Colab.

  • Refer to the Notebook Examples Guide for more details

  • Click here: Open In Colab to run this tutorial interactively in your browser

NOTE: Some of the following sections require this tutorial to be running locally with a supported embedded platform connected.

Running this tutorial from the command-line

While this tutorial uses a Jupyter Notebook, the recommended approach is to use your favorite text editor and standard command terminal, no Jupyter Notebook required.

See the Standard Python Package Installation guide for more details on how to enable the mltk command in your local terminal.

In this mode, when you encounter a !mltk command in this tutorial, the command should actually run in your local terminal (excluding the !)

Install MLTK Python Package

Before using the MLTK, it must first be installed.
See the Installation Guide for more details.

!pip install --upgrade silabs-mltk

All MLTK modeling operations are accessible via the mltk command.
Run the command mltk --help to ensure it is working.
NOTE: The exclamation point ! tells the Notebook to run a shell command, it is not required in a standard terminal

!mltk --help
Usage: mltk [OPTIONS] COMMAND [ARGS]...

  Silicon Labs Machine Learning Toolkit

  This is a Python package with command-line utilities and scripts to aid the
  development of machine learning models for Silicon Lab's embedded platforms.

Options:
  --version  Display the version of this mltk package and exit
  --help     Show this message and exit.

Commands:
  build           MLTK build commands
  classify_audio  Classify keywords/events detected in a microphone's...
  commander       Silab's Commander Utility
  custom          Custom Model Operations
  evaluate        Evaluate a trained ML model
  profile         Profile a model
  quantize        Quantize a model into a .tflite file
  summarize       Generate a summary of a model
  train           Train an ML model
  update_params   Update the parameters of a previously trained model
  utest           Run the all unit tests
  view            View an interactive graph of the given model in a...
  view_audio      View the spectrograms generated by the...

Machine Learning and Keyword-Spotting Overview

Before continuing with this tutorial, it is recommended to review the following presentations:

Dataset Selection and Preprocessing Parameters

Before starting the actual tutorial, let’s first discuss datasets.

TL;DR

  1. A representative dataset must be acquired for the trained model to perform well in the real-world

  2. The dataset should (typically) be transformed so that the model can efficiently learn the features of the dataset

  3. Whatever transformations are used must be identical at training-time on the PC and run-time on the embedded device

  4. The size of the dataset can be effectively increased by randomly augmenting it during training (changing the pitch, speed, adding background noise, etc.)

Acquire a Representative Dataset

The most critical aspect of any machine learning model is the dataset. A representative dataset is necessary to train a robust model. A model that is trained on a dataset that is too small and/or not representative of what would be seen in the real-world will likely not perform well.

In this tutorial, we want to create a keyword spotting classification machine learning model. This implies the following about the dataset:

  • The dataset must contain audio samples of the keywords we want to detect

  • The dataset must be labelled, i.e. each sample in the dataset must have an associated “class”, e.g. “on”, “off”

  • The dataset must be relatively large and representative to account for the variance in spoken language (accents, background noise, etc.)

For this tutorial, we’ll use the Google Speech Commands v2 dataset (NOTE: This dataset is automatically downloaded in a later step in this tutorial).
This dataset is effectively a directory of sub-directories, and each sub-directory contains thousands of 1s audio clips. The name of each sub-directory corresponds to the word being spoken in the audio clip, e.g:

/dataset
/dataset/on
/dataset/on/sample1.wav
/dataset/on/sample2.wav
...
/dataset/off
/dataset/off/sample1.wav
/dataset/off/sample2.wav
...

So this dataset meets our requirements:

  • It contains audio samples of the keywords we want to detect (“on”, “off”)

  • The samples are labelled (all “on” samples are in the “on” sub-directory etc.)

  • The dataset is representative (the audio clips are taken from many different people saying the same words)

NOTE: For many machine learning applications acquiring a dataset will not be so easy. Many times the dataset will suffer from one or more of the following:

  • The dataset does not exist - Need to manually collect samples

  • The raw samples exist but are not “labeled” - Need to manually group the samples

  • The dataset is “dirty” - Bad/corrupt samples, mislabeled samples

  • The dataset is not representative - Duplicate/similar samples, not diverse enough to cover the possible range seen in the real-world

NOTE: A clean, representative dataset is one of the best ways to train a robust model. It is highly recommended to invest the time/energy to create a good dataset!

Feature Engineering

Along with a representative dataset, we (usually) need to transform the individual samples of the dataset so that the machine learning model can efficiently learn the “features” of the dataset, and thus make accurate predictions. This process is frequently called “feature engineering”. One way of describing feature engineering is: Use human insight to amplify the signals of the dataset so that a machine can more efficiently learn the patterns in it.

The transform(s) used for feature engineering are highly application-specific.

For this tutorial, we use the common technique of converting the raw audio into a spectrogram (i.e. gray-scale image). The machine learning model then learns the patterns in the spectrogram images that correspond to the keywords in the audio samples.

Featuring Engineering on the Edge

An important aspect to keep in mind about the transform(s) chosen for featuring engineering is that whatever is done to the dataset samples during training must also be done on the embedded device at run-time. i.e. The exact algorithms used to generate the spectrogram on the PC during training must be used on the embedded device at run-time. Any divergence will cause the embedded model to “see” different samples and likely not perform well (if at all).

For this purpose, the MLTK offers an Audio Feature Generator component. This component generates spectrograms from raw audio. The algorithms used in this component are accessible via:

In this way, the exact spectrogram generation algorithms used during training may also be used at run-time on the embedded device.

Refer to the Audio Feature Generator documentation and Audio Visualization section for more details on how the various parameters used to generate the spectrogram may be determined.

Data Augmentation

A useful technique for expanding the size of a dataset (and hopefully making it more representative) is to apply random augmentations to the training samples. For instance, audio dataset augmentations might include:

  • Increase/decrease speed

  • Increase/decrease pitch

  • Add random background noises

In this way, the model never “sees” the same sample during training which should hopefully make it robust as it has learned from a larger collection of samples.

For this purpose, the MLTK offers an Audio Data Generator Python component. The Audio Data Generator uses the audio dataset as an input and randomly augments the audio samples during training (see the Model Specification section below).

NOTE: Augmentations are only applied during training. They are not applied at run-time.

Refer to the Audio Visualization section for more details on how the various parameters used to augment the audio may be determined.

Model Specification

The model specification is a standard Python script containing everything needed to build, train, and evaluate a machine learning model in the MLTK.

Refer to the Model Specification Guide for more details about this file.

The completed model specification used for this tutorial may be found on Github: keyword_spotting_on_off.py.

The following sub-sections describe how to create this model specification from scratch.

Create the specification script

From your favorite text editor, create a model specification Python script file, e.g: my_keyword_spotting_on_off.py

The name of this file is the name given to the model. So all subsequent mltk commands will use the model name my_keyword_spotting_on_off, e.g:

mltk train my_keyword_spotting_on_off

You may use any name as long as it contains alphanumeric or underscore characters.

When executing a command, the MLTK searches for the model specification script by model name.
The MLTK commands search the current working directory then any configured paths.
Refer to the Model Search Path Guide for more details.

NOTE: The commands below use the pre-defined model name: keyword_spotting_on_off, however, you should replace that with your model’s name, e.g.: my_keyword_spotting_on_off.

NOTE: A new, more robust pre-trained model is available: keyword_spotting_on_off_v2

Add necessary imports

Next, open the newly created Python script: my_keyword_spotting_on_off.py
in your favorite text editor and add the following to the top of the model specification script:

# Import the Tensorflow packages
# required to build the model layout
import numpy as np
import tensorflow as tf
from tensorflow.keras import regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (
    Dense, 
    Activation, 
    Flatten, 
    BatchNormalization,
    Conv2D,
    MaxPooling2D,
    Dropout
)

# Import the MLTK model object 
# and necessary mixins
# Later in this script we configure the various properties
from mltk.core import (
    MltkModel,
    TrainMixin,
    AudioDatasetMixin,
    EvaluateClassifierMixin
)

# Import the Google speech_commands dataset package
# This manages downloading and extracting the dataset
from mltk.datasets.audio.speech_commands import speech_commands_v2

# Import the ParallelAudioDataGenerator
# This has two main jobs:
# 1. Process the Google speech_commands dataset and apply random augmentations during training
# 2. Generate a spectrogram using the AudioFeatureGenerator from each augmented audio sample 
#    and give the spectrogram to Tensorflow for model training
from mltk.core.preprocess.audio.parallel_generator import ParallelAudioDataGenerator
# Import the AudioFeatureGeneratorSettings which we'll configure 
# and give to the ParallelAudioDataGenerator
from mltk.core.preprocess.audio.audio_feature_generator import AudioFeatureGeneratorSettings

These import various Tensorflow and MLTK packages we’ll use throughout the script.
Refer to the comments above each import for more details.

Define Model Object

Next, add the following to the model specification script:

# Define a custom model object with the following 'mixins':
# - TrainMixin        - Provides classifier model training operations and settings
# - AudioDatasetMixin - Provides audio data generation operations and settings
# - EvaluateClassifierMixin     - Provides classifier evaluation operations and settings
# @mltk_model # NOTE: This tag is required for this model be discoverable
class MyModel(
    MltkModel, 
    TrainMixin, 
    AudioDatasetMixin, 
    EvaluateClassifierMixin
):
    pass

# Instantiate our custom model object
# The rest of this script simply configures the properties
# of our custom model object
my_model = MyModel()

This defines and instantiates a custom MltkModel object with several model “mixins”.

The custom model object must inherit the MltkModel object.
Additionally, it inherits:

The rest of the model specification script configures the various properties of our custom model object.

Configure the general model settings

# For better tracking, the version should be incremented any time a non-trivial change is made
# NOTE: The version is optional and not used directly used by the MLTK
my_model.version = 1 
# Provide a brief description about what this model models
# This description goes in the "description" field of the .tflite model file
my_model.description = 'Keyword spotting classifier to detect: "on" and "off"'

Configure the basic training settings

Refer to the TrainMixin for more details about each property.

# This specifies the number of times we run the training
# samples through the model to update the model weights.
# Typically, a larger value leads to better accuracy at the expense of training time.
# Set to -1 to use the early_stopping callback and let the scripts
# determine how many epochs to train for (see below).
# Otherwise set this to a specific value (typically 40-200)
my_model.epochs = 80
# Specify how many samples to pass through the model
# before updating the training gradients.
# Typical values are 10-64
# NOTE: Larger values require more memory and may not fit on your GPU
my_model.batch_size = 10 
# This specifies the algorithm used to update the model gradients
# during training. Adam is very common
# See https://www.tensorflow.org/api_docs/python/tf/keras/optimizers
my_model.optimizer = 'adam' 
# List of metrics to be evaluated by the model during training and testing
my_model.metrics = ['accuracy']
# The "loss" function used to update the weights
# This is a classification problem with more than two labels so we use categorical_crossentropy
# See https://www.tensorflow.org/api_docs/python/tf/keras/losses
my_model.loss = 'categorical_crossentropy'

Configure the training callbacks

Refer to the TrainMixin for more details about each property.

# Generate checkpoints every time the validation accuracy improves
# See https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint
my_model.checkpoint['monitor'] =  'val_accuracy'

# If the training accuracy doesn't improve after 'patience' epochs 
# then decrease the learning rate by 'factor'
# https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ReduceLROnPlateau
# NOTE: Alternatively, we could define our own learn rate schedule
#       using my_model.lr_schedule
# my_model.reduce_lr_on_plateau = dict(
#  monitor='accuracy',
#  factor = 0.25,
#  patience = 10
#)

# https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LearningRateScheduler
# Update the learning rate each epoch based on the given callback
def lr_schedule(epoch):
    initial_learning_rate = 0.001
    decay_per_epoch = 0.95
    lrate = initial_learning_rate * (decay_per_epoch ** epoch)
    return lrate

my_model.lr_schedule = dict(
    schedule = lr_schedule,
    verbose = 1
)

Configure the TF-Lite Converter settings

The Tensorflow-Lite Converter is used to “quantize” the model.
The quantized model is what is eventually programmed to the embedded device.

Refer to the Model Quantization Guide for more details.

# These are the settings used to quantize the model
# We want all the internal ops as well as
# model input/output to be int8
my_model.tflite_converter['optimizations'] = [tf.lite.Optimize.DEFAULT]
my_model.tflite_converter['supported_ops'] = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# NOTE: A float32 model input/output is also possible
my_model.tflite_converter['inference_input_type'] = np.int8 
my_model.tflite_converter['inference_output_type'] = np.int8
# Automatically generate a representative dataset from the validation data
my_model.tflite_converter['representative_dataset'] = 'generate'

Configure the dataset settings

Next, we specify the dataset. In this tutorial we use the Google Speech Commands v2 dataset which comes as an MLTK package.

NOTE: While the MLTK comes with pre-defined datasets, any external dataset may also be specified.
Refer to the AudioDatasetMixin.dataset property for more details.

NOTE: While a dataset path can be hard coded, it is strongly recommended that the script dynamically downloads the dataset from the internet. This allows for the model training and evaluating to be reproducible. It also enables remote training on cloud services like Google Colab which need to download the dataset any time a virtual instance is created.

# Specify the dataset 
# NOTE: This can also be an absolute path to a directory
#       or a Python function
# See: https://siliconlabs.github.io/mltk/docs/python_api/mltk_model/audio_dataset_mixin.html#mltk.core.AudioDatasetMixin.dataset
my_model.dataset = speech_commands_v2
# We're using a 'categorical_crossentropy' loss
# so must also use a `categorical` class mode for the data generation
my_model.class_mode = 'categorical'

Configure the keywords to detect

This is likely the most interesting part of the model specification script.
Here, we define which keywords we want our model to detect.
For this tutorial, we want to detect on and off, however, you may modify this to any keyword that is found in the Google Speech Commands dataset:

Yes, No, Up, Down, Left, Right, On, Off, Stop, Go, Zero, One, Two, Three, Four, Five, Six, Seven, Eight, Nine

NOTE: See the Transfer Learning Tutorial which describes how to use the MobileNetV2 model. While MobileNetV2 is larger, it tends to have better performance than the model used in this tutorial which was specifically designed for the “yes/no” keywords.

# Specify the keywords we want to detect
# In this model, we detect "on" and "off",
# plus two pseudo classes: _unknown_ and _silence_
#
# Any number of classes may be added here as long as they're
# found in the dataset specified above.
# NOTE: You'll likely need a larger model for more classes
my_model.classes = ['on', 'off', '_unknown_', '_silence_']

Configure the AudioFeatureGenerator settings

Next, we specify the settings used to generate the spectrograms.
Spectrograms are generated by the AudioFeatureGenerator MLTK Python component.
See the Audio Feature Generator guide for more details.

Refer to the Model Parameters section below for more details on how these settings eventually make it onto the embedded device.

Also, refer to the section Audio Visualization for more details on how to determine which settings to use.

# These are the settings used by the AudioFeatureGenerator 
# to generate spectrograms from the audio samples
# These settings must be used during modeling training
# AND by embedded device at runtime
#
# See the command: "mltk view_audio"
# to get a better idea of how to specify these settings
frontend_settings = AudioFeatureGeneratorSettings()

frontend_settings.sample_rate_hz = 8000  # This can also be 16k for slightly better performance at the cost of more RAM
frontend_settings.sample_length_ms = 1000
frontend_settings.window_size_ms = 30
frontend_settings.window_step_ms = 20
frontend_settings.filterbank_n_channels = 32
frontend_settings.filterbank_upper_band_limit = 4000.0-1 # Spoken language usually only goes up to 4k
frontend_settings.filterbank_lower_band_limit = 100.0
frontend_settings.noise_reduction_enable = True
frontend_settings.noise_reduction_smoothing_bits = 5
frontend_settings.noise_reduction_even_smoothing = 0.004
frontend_settings.noise_reduction_odd_smoothing = 0.004
frontend_settings.noise_reduction_min_signal_remaining = 0.05
frontend_settings.pcan_enable = False
frontend_settings.pcan_strength = 0.95
frontend_settings.pcan_offset = 80.0
frontend_settings.pcan_gain_bits = 21
frontend_settings.log_scale_enable = True
frontend_settings.log_scale_shift = 6

Configure the data augmentation settings

Next, we configure how we want to augment the dataset during training.
See the ParallelAudioDataGenerator API doc for more details.

Refer to the section Audio Visualization for more details on how to determine which settings to use.

# Configure the data generator settings
# This specifies how to augment the training samples
# See the command: "mltk visualize_audio"
# to get a better idea of how these augmentations affect
# the samples
my_model.datagen = ParallelAudioDataGenerator(
    dtype=my_model.tflite_converter['inference_input_type'],
    frontend_settings=frontend_settings,
    cores=0.45, # Adjust this as necessary for your PC setup
    debug=False, # Set this to true to enable debugging of the generator
    max_batches_pending=16,  # Adjust this as necessary for your PC setup (smaller -> less RAM)
    validation_split= 0.10,
    validation_augmentation_enabled=True,
    samplewise_center=False,
    samplewise_std_normalization=False,
    rescale=None,
    unknown_class_percentage=2.0, # Increasing this may help model robustness at the expense of training time
    silence_class_percentage=0.3,
    offset_range=(0.0,1.0),
    trim_threshold_db=30,
    noise_colors=None,
    loudness_range=(0.2, 1.0),
    speed_range=(0.9,1.1),
    pitch_range=(0.9,1.1),
    #vtlp_range=(0.9,1.1),
    bg_noise_range=(0.1,0.4),
    bg_noise_dir='_background_noise_' # This is a directory provided by the google speech commands dataset, can also provide an absolute path
)

Define the model layout

This defines the actual structure of the model that runs on the embedded device using the Keras API. The details of how to create the model structure are out-of-scope for this tutorial.

Please note that many times you do not need to define you own model. Instead, you can use a pre-defined model such as MobileNetV1 or ResNetv1-10. The MLTK provides some common models as Python packages in the Shared Models section with example usage such as image_classification.py.

Note: The model used in this tutorial was developed by Silicon Labs and is covered by a standard Silicon Labs MSLA

# This defines the actual model layout using the Keras API.
# This particular model is a relatively standard
# sequential Convolution Neural Network (CNN).
#
# It is important to the note the usage of the 
# "model" argument.
# Rather than hardcode values, the model is
# used to build the model, e.g.:
# Dense(model.n_classes)
#
# This way, the various model properties above can be modified
# without having to re-write this section.
#
#
def my_model_builder(model: MyModel):
    weight_decay = 1e-4
    regularizer = regularizers.l2(weight_decay)
    input_shape = model.input_shape
    filters = 8
 
    keras_model = Sequential(name=model.name, layers = [
        Conv2D(filters, (3,3), 
            padding='same', 
            kernel_regularizer=regularizer, 
            input_shape=input_shape, 
            strides=(2,2)
        ),
        BatchNormalization(),
        Activation('relu'),

        Conv2D(2*filters, (3,3), 
            padding='same', 
            kernel_regularizer=regularizer, 
            strides=(2,2)
        ),
        BatchNormalization(),
        Activation('relu'),
        Dropout(rate=0.1),

        Conv2D(4*filters, (3,3), 
            padding='same', 
            kernel_regularizer=regularizer, 
            strides=(2,2)
        ),
        BatchNormalization(),
        Activation('relu'),
        Dropout(rate=0.3),
        
        MaxPooling2D(pool_size=[7,1]),
        
        Flatten(),
        Dense(model.n_classes, activation='softmax')
    ])
 
    keras_model.compile(
        loss=model.loss, 
        optimizer=model.optimizer, 
        metrics=model.metrics
    )
    return keras_model

my_model.build_model_function = my_model_builder

At this point, the model specification script should have everything needed to train, evaluate, and generate model file that can run on an embedded device.
The following sections describe how to use the MLTK to perform these tasks.

Audio Visualization

NOTE: This section is experimental and is optional for the rest of this tutorial. You may safely skip to the next section.

Before training the model, it is important that the generated spectrogram has enough detail from which the ML model can learn (i.e. “feature engineering”). The AudioFeatureGenerator has numerous settings to control how the spectrogram is generated.

For this purpose, the MLTK features an experimental command: view_audio which allows for visualizing a generated spectrogram in real-time as the various parameters are adjusted via GUI. It also allows for adjusting the various augmentation parameters and listening to the audio playback.

See the Audio Feature Generator guide for more details.

NOTE: Internally, this command uses wxPython and must run locally. It will not work on a remote server (e.g. Colab).

# Invoke the view_audio command from a LOCAL terminal
# NOTE: Change this command to use
#      "my_keyword_spotting_on_off" or whatever you called your model
!mltk view_audio keyword_spotting_on_off

After running this command and playing with the GUI, you should have a better idea of what settings to use for the AudioFeatureGenerator and data augmentation parameters.

NOTE: Care should be given when selecting the spectrogram size. e.g. The dimensions given in the upper-left:

Visualizer Dimensions

A larger spectrogram means a larger model input which ultimately means more processing that is required by the embedded device at run-time.

See the Model Optimization Tutorial for more details.

Model Parameters

As stated in the Feature Engineering on the Edge section, it is extremely important that whatever transforms are done to the dataset during training are also done at run-time on the embedded device.

To help with this, the MLTK allows for embedding parameters into the generated .tflite model file.

Refer to the Model Parameters Guide for more details about how this works.

This is useful for this tutorial as the MLTK will automatically embed all of the AudioFeatureGeneratorSettings into the generated .tflite model file. Later, the Gecko SDK will read the settings from the .tflite model file when generating the project. In this way, the AudioFeatureGenerator that runs on the embedded device will use the exact same settings.

NOTE: The mltk summarize --tflite command prints all the parameters that are embedded into the .tflite model file, including the AudioFeatureGenerator settings.

Model Summary

With the model specification complete, it is sometimes useful to generate a summary of the model before we spend the time to train it.
This can be done using the summarize command.

If you’re using a local terminal, navigate to the same directory are your model specification script, e.g. my_keyword_spotting_on_off.py and modify the commands to use my_keyword_spotting_on_off or whatever you called your model.

NOTE: Since we have not trained our model yet, we must add the --build option to the command.
Once the model is trained, this option is not required.

# Summarize the Keras Model
# This is the non-quantized model used for training
# NOTE: Running this the first time may take awhile since the audio dataset needs to be downloaded
!mltk summarize keyword_spotting_on_off --build 
Epoch 1: LearningRateScheduler setting learning rate to 0.001.

Epoch 2: LearningRateScheduler setting learning rate to 0.00095.

Epoch 3: LearningRateScheduler setting learning rate to 0.0009025.
Model: "keyword_spotting_on_off"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 25, 16, 8)         80        
                                                                 
 batch_normalization (BatchN  (None, 25, 16, 8)        32        
 ormalization)                                                   
                                                                 
 activation (Activation)     (None, 25, 16, 8)         0         
                                                                 
 conv2d_1 (Conv2D)           (None, 13, 8, 16)         1168      
                                                                 
 batch_normalization_1 (Batc  (None, 13, 8, 16)        64        
 hNormalization)                                                 
                                                                 
 activation_1 (Activation)   (None, 13, 8, 16)         0         
                                                                 
 dropout (Dropout)           (None, 13, 8, 16)         0         
                                                                 
 conv2d_2 (Conv2D)           (None, 7, 4, 32)          4640      
                                                                 
 batch_normalization_2 (Batc  (None, 7, 4, 32)         128       
 hNormalization)                                                 
                                                                 
 activation_2 (Activation)   (None, 7, 4, 32)          0         
                                                                 
 dropout_1 (Dropout)         (None, 7, 4, 32)          0         
                                                                 
 max_pooling2d (MaxPooling2D  (None, 1, 4, 32)         0         
 )                                                               
                                                                 
 flatten (Flatten)           (None, 128)               0         
                                                                 
 dense (Dense)               (None, 4)                 516       
                                                                 
=================================================================
Total params: 6,628
Trainable params: 6,516
Non-trainable params: 112
_________________________________________________________________

Total MACs: 278.144 k
Total OPs: 574.468 k
Name: keyword_spotting_on_off
Version: 1
Description: Keyword spotting classifier to detect: "on" and "off"
Classes: on, off, _unknown_, _silence_
hash: 
date: 
runtime_memory_size: 0
average_window_duration_ms: 1000
detection_threshold: 160
suppression_ms: 750
minimum_count: 3
volume_gain: 2
latency_ms: 100
verbose_model_output_logs: False
# Summarize the TF-Lite Model
# This is the quantized model that eventually goes on the embedded device
!mltk summarize keyword_spotting_on_off --tflite --build
Epoch 1: LearningRateScheduler setting learning rate to 0.001.

Epoch 2: LearningRateScheduler setting learning rate to 0.00095.

Epoch 3: LearningRateScheduler setting learning rate to 0.0009025.
C:\Users\reed\workspace\silabs\mltk\.venv\lib\site-packages\tensorflow\lite\python\convert.py:746: UserWarning: Statistics for quantized inputs were expected, but not specified; continuing anyway.
  warnings.warn("Statistics for quantized inputs were expected, but not "
+-------+-----------------+----------------+----------------+-----------------------------------------------------+
| Index | OpCode          | Input(s)       | Output(s)      | Config                                              |
+-------+-----------------+----------------+----------------+-----------------------------------------------------+
| 0     | conv_2d         | 49x32x1 (int8) | 25x16x8 (int8) | Padding:same stride:2x2 activation:relu             |
|       |                 | 3x3x1 (int8)   |                |                                                     |
|       |                 | 8 (int32)      |                |                                                     |
| 1     | conv_2d         | 25x16x8 (int8) | 13x8x16 (int8) | Padding:same stride:2x2 activation:relu             |
|       |                 | 3x3x8 (int8)   |                |                                                     |
|       |                 | 16 (int32)     |                |                                                     |
| 2     | conv_2d         | 13x8x16 (int8) | 7x4x32 (int8)  | Padding:same stride:2x2 activation:relu             |
|       |                 | 3x3x16 (int8)  |                |                                                     |
|       |                 | 32 (int32)     |                |                                                     |
| 3     | max_pool_2d     | 7x4x32 (int8)  | 1x4x32 (int8)  | Padding:valid stride:1x7 filter:1x7 activation:none |
| 4     | reshape         | 1x4x32 (int8)  | 128 (int8)     | BuiltinOptionsType=0                                |
|       |                 | 2 (int32)      |                |                                                     |
| 5     | fully_connected | 128 (int8)     | 4 (int8)       | Activation:none                                     |
|       |                 | 128 (int8)     |                |                                                     |
|       |                 | 4 (int32)      |                |                                                     |
| 6     | softmax         | 4 (int8)       | 4 (int8)       | BuiltinOptionsType=9                                |
+-------+-----------------+----------------+----------------+-----------------------------------------------------+
Total MACs: 278.144 k
Total OPs: 563.084 k
Name: keyword_spotting_on_off
Version: 1
Description: Keyword spotting classifier to detect: "on" and "off"
Classes: on, off, _unknown_, _silence_
hash: a5c31da1954ca849eed61dd1007ddf58
date: 2022-04-25T18:23:56.543Z
runtime_memory_size: 7052
average_window_duration_ms: 1000
detection_threshold: 160
suppression_ms: 750
minimum_count: 3
volume_gain: 2
latency_ms: 100
verbose_model_output_logs: False
samplewise_norm.rescale: 0.0
samplewise_norm.mean_and_std: False
fe.sample_rate_hz: 8000
fe.sample_length_ms: 1000
fe.window_size_ms: 30
fe.window_step_ms: 20
fe.filterbank_n_channels: 32
fe.filterbank_upper_band_limit: 3999.0
fe.filterbank_lower_band_limit: 100.0
fe.noise_reduction_enable: True
fe.noise_reduction_smoothing_bits: 5
fe.noise_reduction_even_smoothing: 0.004000000189989805
fe.noise_reduction_odd_smoothing: 0.004000000189989805
fe.noise_reduction_min_signal_remaining: 0.05000000074505806
fe.pcan_enable: False
fe.pcan_strength: 0.949999988079071
fe.pcan_offset: 80.0
fe.pcan_gain_bits: 21
fe.log_scale_enable: True
fe.log_scale_shift: 6
fe.fft_length: 256
.tflite file size: 14.4kB

Model Visualization

The MLTK also allows for visualizing the model in an interactive webpage.

This is done using the view command. Refer to the Model Visualization Guide for more details on how this works.

NOTES:

  • This will open a new tab to your web-browser

  • You must click the opened webpage’s ‘Accept’ button the first time it runs (and possibly re-run the command)

  • Since we have not trained our model yet, we must add the --build option to the command. This is not required once the model is trained.

  • This command must run locally, it will not work from a remote terminal/notebook

Visualize Keras model

By default, the view command will visualize the KerasModel, the model used for training (file extension .h5).

# This will open a new tab in your web browser
# Be sure the click the 'Accept' button in the opened webpage
# (you may need to re-run this command after doing so)
!mltk view keyword_spotting_on_off --build
Serving 'E:/reed/mltk/tmp_models/model.h5' at http://localhost:8080
Stopping http://localhost:8080

Visualize TF-Lite model

Alternatively, the --tflite flag can be used to view the TfliteModel, the quantized model that is programmed to the embedded device (file extension .tflite).

Note that the structure of the Keras and TfLite models are similar, but the TfLite model is a bit more simple. This is because the TF-Lite Converter optimized the model by merging/fusing as many layers as possible.

# This will open a new tab in your web browser
# Be sure the click the 'Accept' button in the opened webpage
# (you may need to re-run this command after doing so)
!mltk view keyword_spotting_on_off --tflite --build
Epoch 00001: LearningRateScheduler setting learning rate to 0.001.

Epoch 00002: LearningRateScheduler setting learning rate to 0.00095.

Epoch 00003: LearningRateScheduler setting learning rate to 0.0009025.
Serving 'E:/reed/mltk/tmp_models/keyword_spotting_on_off.tflite' at http://localhost:8080
Stopping http://localhost:8080
fully_quantize: 0, inference_type: 6, input_inference_type: 9, output_inference_type: 9

Model Profiler

Before spending the time and energy to train the model, it may be useful to profile the model to determine how efficiently it may run on the embedded device. If it’s determined that the model does not fit within the time or memory constraints, then the model layout should be adjusted, the model input size should be reduced, and/or a different model should be selected.

For this reason, th MLTK features a model profiler. Refer to the Model Profiler Guide for more details.

NOTE: The following examples use the --build flag since the model has not been trained yet. Once the model is trained this flag is no longer needed.

Profile in simulator

The following command will profile our model in the MVP hardware simulator and return estimates about the time and energy the model might require on the embedded device.

NOTES:

  • An embedded device does not needed to be locally connected to run this command.

  • Remove the --accelerator MVP option if you are targeting a device that does not have an MVP hardware accelerator.

!mltk profile keyword_spotting_on_off --build --accelerator MVP
Epoch 1: LearningRateScheduler setting learning rate to 0.001.

Epoch 2: LearningRateScheduler setting learning rate to 0.00095.

Epoch 3: LearningRateScheduler setting learning rate to 0.0009025.
C:\Users\reed\workspace\silabs\mltk\.venv\lib\site-packages\tensorflow\lite\python\convert.py:746: UserWarning: Statistics for quantized inputs were expected, but not specified; continuing anyway.
  warnings.warn("Statistics for quantized inputs were expected, but not "

Profiling Summary
Name: keyword_spotting_on_off
Accelerator: MVP
Input Shape: 1x49x32x1
Input Data Type: int8
Output Shape: 1x4
Output Data Type: int8
Flash, Model File Size (bytes): 14.4k
RAM, Runtime Memory Size (bytes): 6.8k
Operation Count: 574.5k
Multiply-Accumulate Count: 278.1k
Layer Count: 7
Unsupported Layer Count: 0
Accelerator Cycle Count: 427.8k
CPU Cycle Count: 76.8k
CPU Utilization (%): 16.7
Clock Rate (hz): 80.0M
Time (s): 5.7m
Energy (J): 137.6u
J/Op: 239.6p
J/MAC: 494.8p
Ops/s: 99.9M
MACs/s: 48.4M
Inference/s: 173.9

Model Layers
+-------+-----------------+--------+--------+------------+------------+------------+----------+------------------------+--------------+-----------------------------------------------------+
| Index | OpCode          | # Ops  | # MACs | Acc Cycles | CPU Cycles | Energy (J) | Time (s) | Input Shape            | Output Shape | Options                                             |
+-------+-----------------+--------+--------+------------+------------+------------+----------+------------------------+--------------+-----------------------------------------------------+
| 0     | conv_2d         | 67.2k  | 28.8k  | 92.0k      | 11.5k      | 45.7u      | 1.1m     | 1x49x32x1,8x3x3x1,8    | 1x25x16x8    | Padding:same stride:2x2 activation:relu             |
| 1     | conv_2d         | 244.6k | 119.8k | 170.2k     | 15.9k      | 45.7u      | 2.1m     | 1x25x16x8,16x3x3x8,16  | 1x13x8x16    | Padding:same stride:2x2 activation:relu             |
| 2     | conv_2d         | 260.7k | 129.0k | 164.3k     | 15.9k      | 45.7u      | 2.1m     | 1x13x8x16,32x3x3x16,32 | 1x7x4x32     | Padding:same stride:2x2 activation:relu             |
| 3     | max_pool_2d     | 896.0  | 0      | 576.0      | 27.3k      | 446.4n     | 341.3u   | 1x7x4x32               | 1x1x4x32     | Padding:valid stride:1x7 filter:1x7 activation:none |
| 4     | reshape         | 0      | 0      | 0          | 264.9      | 0.0p       | 3.3u     | 1x1x4x32,2             | 1x128        | BuiltinOptionsType=0                                |
| 5     | fully_connected | 1.0k   | 512.0  | 784.0      | 1.8k       | 50.5n      | 22.2u    | 1x128,4x128,4          | 1x4          | Activation:none                                     |
| 6     | softmax         | 20.0   | 0      | 0          | 4.1k       | 16.5n      | 51.8u    | 1x4                    | 1x4          | BuiltinOptionsType=9                                |
+-------+-----------------+--------+--------+------------+------------+------------+----------+------------------------+--------------+-----------------------------------------------------+
Generating profiling report at C:/Users/reed/.mltk/models/keyword_spotting_on_off-test/profiling
Profiling time: 57.664001 seconds

Profile on physical device

Alternatively, if we have a device locally connected, we can directly profile on that instead. This is useful as the returned profiling numbers are “real”, they are not estimated as they would be in the simulator case.

To profile on a physical device, simply added the --device command flag.

NOTES:

  • An embedded device must be locally connected to run this command.

  • Remove the --accelerator MVP option if you are targeting a device that does not have an MVP hardware accelerator.

!mltk profile keyword_spotting_on_off --build --device --accelerator MVP
Epoch 1: LearningRateScheduler setting learning rate to 0.001.

Epoch 2: LearningRateScheduler setting learning rate to 0.00095.

Epoch 3: LearningRateScheduler setting learning rate to 0.0009025.
C:\Users\reed\workspace\silabs\mltk\.venv\lib\site-packages\tensorflow\lite\python\convert.py:746: UserWarning: Statistics for quantized inputs were expected, but not specified; continuing anyway.
  warnings.warn("Statistics for quantized inputs were expected, but not "
Extracting: C:/Users/reed/.mltk/downloads/mltk_model_profiler-brd2601-mvp-616ee87e.zip
to: C:/Users/reed/.mltk/firmware/mltk_model_profiler-brd2601-mvp
(This may take awhile, please be patient ...)

Profiling Summary
Name: keyword_spotting_on_off
Accelerator: MVP
Input Shape: 1x49x32x1
Input Data Type: int8
Output Shape: 1x4
Output Data Type: int8
Flash, Model File Size (bytes): 14.4k
RAM, Runtime Memory Size (bytes): 6.8k
Operation Count: 574.5k
Multiply-Accumulate Count: 278.1k
Layer Count: 7
Unsupported Layer Count: 0
Accelerator Cycle Count: 439.8k
CPU Cycle Count: 92.3k
CPU Utilization (%): 18.9
Clock Rate (hz): 80.0M
Time (s): 6.1m
Ops/s: 93.9M
MACs/s: 45.4M
Inference/s: 163.4

Model Layers
+-------+-----------------+--------+--------+------------+------------+----------+------------------------+--------------+-----------------------------------------------------+
| Index | OpCode          | # Ops  | # MACs | Acc Cycles | CPU Cycles | Time (s) | Input Shape            | Output Shape | Options                                             |
+-------+-----------------+--------+--------+------------+------------+----------+------------------------+--------------+-----------------------------------------------------+
| 0     | conv_2d         | 67.2k  | 28.8k  | 98.5k      | 19.1k      | 1.3m     | 1x49x32x1,8x3x3x1,8    | 1x25x16x8    | Padding:same stride:2x2 activation:relu             |
| 1     | conv_2d         | 244.6k | 119.8k | 173.6k     | 18.7k      | 2.2m     | 1x25x16x8,16x3x3x8,16  | 1x13x8x16    | Padding:same stride:2x2 activation:relu             |
| 2     | conv_2d         | 260.7k | 129.0k | 166.2k     | 18.8k      | 2.1m     | 1x13x8x16,32x3x3x16,32 | 1x7x4x32     | Padding:same stride:2x2 activation:relu             |
| 3     | max_pool_2d     | 896.0  | 0      | 800.0      | 28.2k      | 360.0u   | 1x7x4x32               | 1x1x4x32     | Padding:valid stride:1x7 filter:1x7 activation:none |
| 4     | reshape         | 0      | 0      | 0          | 1.1k       | 0        | 1x1x4x32,2             | 1x128        | BuiltinOptionsType=0                                |
| 5     | fully_connected | 1.0k   | 512.0  | 809.0      | 2.1k       | 60.0u    | 1x128,4x128,4          | 1x4          | Activation:none                                     |
| 6     | softmax         | 20.0   | 0      | 0          | 4.3k       | 30.0u    | 1x4                    | 1x4          | BuiltinOptionsType=9                                |
+-------+-----------------+--------+--------+------------+------------+----------+------------------------+--------------+-----------------------------------------------------+
Generating profiling report at C:/Users/reed/.mltk/models/keyword_spotting_on_off-test/profiling
Profiling time: 91.943971 seconds

Note about CPU utilization

An important metric the model profiler provides when using the MVP hardware accelerator is CPU Utilization. This gives an indication of how much CPU is required to run the machine learning model.

If no hardware accelerator is used, then the CPU utilization is 100% as 100% of the machine learning model’s calculations are executed on the CPU. With the hardware accelerator, many of the model’s calculations can be offloaded to the accelerator freeing the CPU to do other tasks.

The additional CPU cycles the hardware accelerator provides can be a major benefit, especially when other tasks such as real-time audio processing are required.

Model Training

Now that we have our model fully specified and it fits within the constraints of the embedded device, we can train the model.

The basic flow for model training is:

  1. Invoke the train command

  2. Tensorflow trains the model

  3. A Model Archive containing the trained model is generated in the same directory as the model specification script

Refer to the Model Training Guide for more details about this process.

Train as a “dry run”

Before fully training the model, sometimes it is useful to train the model as a “dry run” to ensure the end-to-end training process works. Here, the model is trained for a few epochs on a subset of the dataset.

To train as a dry run, append -test to the model name.
At the end of training, a Model Archive with -test appended to the archive name is generated in the same directory as the model specification script.

# Train as a dry run by appending "-test" to the model name
!mltk train keyword_spotting_on_off-test
Enabling test mode
Model: "keyword_spotting_on_off"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 25, 16, 8)         80        
                                                                 
 batch_normalization (BatchN  (None, 25, 16, 8)        32        
 ormalization)                                                   
                                                                 
 activation (Activation)     (None, 25, 16, 8)         0         
                                                                 
 conv2d_1 (Conv2D)           (None, 13, 8, 16)         1168      
                                                                 
 batch_normalization_1 (Batc  (None, 13, 8, 16)        64        
 hNormalization)                                                 
                                                                 
 activation_1 (Activation)   (None, 13, 8, 16)         0         
                                                                 
 dropout (Dropout)           (None, 13, 8, 16)         0         
                                                                 
 conv2d_2 (Conv2D)           (None, 7, 4, 32)          4640      
                                                                 
 batch_normalization_2 (Batc  (None, 7, 4, 32)         128       
 hNormalization)                                                 
                                                                 
 activation_2 (Activation)   (None, 7, 4, 32)          0         
                                                                 
 dropout_1 (Dropout)         (None, 7, 4, 32)          0         
                                                                 
 max_pooling2d (MaxPooling2D  (None, 1, 4, 32)         0         
 )                                                               
                                                                 
 flatten (Flatten)           (None, 128)               0         
                                                                 
 dense (Dense)               (None, 4)                 516       
                                                                 
=================================================================
Total params: 6,628
Trainable params: 6,516
Non-trainable params: 112
_________________________________________________________________

Total MACs: 278.144 k
Total OPs: 574.468 k
Name: keyword_spotting_on_off
Version: 1
Description: Keyword spotting classifier to detect: "yes" and "no"
Classes: on, off, _unknown_, _silence_
hash: None
date: None
average_window_duration_ms: 1000
fully_quantize: 0, inference_type: 6, input_inference_type: 9, output_inference_type: 9
detection_threshold: 160
suppression_ms: 750
minimum_count: 3
volume_db: 5.0
latency_ms: 0
log_level: info
Test mode enabled, forcing max_samples_per_class=3, batch_size=3
NOTE: ProcessPoolManager using ThreadPool (instead of ProcessPool)
ProcessPoolManager using 1 of 24 CPU cores
NOTE: You may need to adjust the "cores" parameter of the data generator if you're experiencing performance issues
Training dataset: Found 13 samples belonging to 4 classes:
        on = 3
       off = 3
 _unknown_ = 6
 _silence_ = 1
Validation dataset: Found 13 samples belonging to 4 classes:
        on = 3
       off = 3
 _unknown_ = 6
 _silence_ = 1
Forcing epochs=3 since test=true
Class weights:
       on = 1.08
      off = 1.08
_unknown_ = 0.54
_silence_ = 3.25
Starting model training ...

Epoch 00001: LearningRateScheduler setting learning rate to 0.001.
Epoch 1/3

1/5 [=====>........................] - ETA: 20s - loss: 3.0739 - accuracy: 0.0000e+00
2/5 [===========>..................] - ETA: 1s - loss: 1.9445 - accuracy: 0.1667     
4/5 [=======================>......] - ETA: 0s - loss: 3.2955 - accuracy: 0.3000
5/5 [==============================] - 7s 507ms/step - loss: 2.6482 - accuracy: 0.3846 - val_loss: 2.3505 - val_accuracy: 0.2308 - lr: 0.0010

Epoch 00002: LearningRateScheduler setting learning rate to 0.00095.
Epoch 2/3

1/5 [=====>........................] - ETA: 0s - loss: 5.1078 - accuracy: 0.0000e+00
5/5 [==============================] - ETA: 0s - loss: 2.3484 - accuracy: 0.1538    
5/5 [==============================] - 1s 120ms/step - loss: 2.3484 - accuracy: 0.1538 - val_loss: 1.6892 - val_accuracy: 0.1538 - lr: 9.5000e-04

Epoch 00003: LearningRateScheduler setting learning rate to 0.0009025.
Epoch 3/3

1/5 [=====>........................] - ETA: 0s - loss: 0.8508 - accuracy: 0.6667
3/5 [=================>............] - ETA: 0s - loss: 1.6849 - accuracy: 0.3333
4/5 [=======================>......] - ETA: 0s - loss: 1.5885 - accuracy: 0.3000
5/5 [==============================] - ETA: 0s - loss: 1.5144 - accuracy: 0.3846
5/5 [==============================] - 1s 258ms/step - loss: 1.5144 - accuracy: 0.3846 - val_loss: 1.4552 - val_accuracy: 0.2308 - lr: 9.0250e-04
Generating C:/Users/reed/.mltk/models/keyword_spotting_on_off-test/keyword_spotting_on_off.test.h5


*** Best training val_accuracy = 0.231


Creating c:/users/reed/workspace/silabs/mltk/mltk/models/siliconlabs/keyword_spotting_on_off-test.mltk.zip
Test mode enabled, forcing max_samples_per_class=3, batch_size=3
NOTE: ProcessPoolManager using ThreadPool (instead of ProcessPool)
ProcessPoolManager using 1 of 24 CPU cores
NOTE: You may need to adjust the "cores" parameter of the data generator if you're experiencing performance issues
Generating C:/Users/reed/.mltk/models/keyword_spotting_on_off-test/keyword_spotting_on_off.test.tflite
Updating c:/users/reed/workspace/silabs/mltk/mltk/models/siliconlabs/keyword_spotting_on_off-test.mltk.zip
Training complete
Training logs here: C:/Users/reed/.mltk/models/keyword_spotting_on_off-test
Trained model files here: c:/users/reed/workspace/silabs/mltk/mltk/models/siliconlabs/keyword_spotting_on_off-test.mltk.zip
Evaluating the .h5 model ...
Name: keyword_spotting_on_off
Model Type: classification
Overall accuracy: 23.077%
Class accuracies:
- on = 100.000%
- off = 0.000%
- _unknown_ = 0.000%
- _silence_ = 0.000%
Average ROC AUC: 47.738%
Class ROC AUC:
- off = 71.667%
- _silence_ = 58.333%
- on = 38.333%
- _unknown_ = 22.619%

Evaluating the .tflite model ...
Name: keyword_spotting_on_off
Model Type: classification
Overall accuracy: 23.077%
Class accuracies:
- on = 100.000%
- off = 0.000%
- _unknown_ = 0.000%
- _silence_ = 0.000%
Average ROC AUC: 47.292%
Class ROC AUC:
- off = 71.667%
- _silence_ = 54.167%
- on = 38.333%
- _unknown_ = 25.000%

Training locally

One option for training your model is to run the train command in your local terminal.
Most of the models used by embedded devices are small enough that this is a feasible option.
Never the less, this is a very CPU intensive operation. Many times it’s best to issue the train command and let it run over night.

See the Note about training time section below for more details.

NOTE: Training a model from scratch can be very time-consuming. See the Transfer Learning Tutorial for how to speed this process up.

# Be sure to replace "keyword_spotting_on_off"
# with the name of your model
# WARNING: This command may take several hours
!mltk train keyword_spotting_on_off

Train in cloud

Alternatively, you can vastly improve the model training time by training this model in the “cloud”.
See the tutorial: Cloud Training with vast.ai for more details.

Note about training time

TL;DR: Improve training time by using a PC/cloud VM with lots of CPU cores (4-16)

Models intended for embedded devices are typically “small”. Thus, while one or more powerful GPUs will certainly improve training times, there is an upper limit to their benefit. Many times, the bottleneck during training comes from the data preprocessing which is usually done on the CPU. i.e. The GPU(s) train the model faster than the CPU(s) can generate the next round of training data.

For this reason, it is usually beneficial to use a PC/cloud VM with multiple cores. This way, multiple CPUs can generate training data in parallel while the GPU is always fed with more training data.

The MLTK comes with two packages to leverage a multi-core system:

These generate training data by spawning multiple processes. This allows for multiple cores to generate training data simultaneously.

Model Evaluation

With our model trained, we can now evaluate it to see how accurate it is.

The basic idea behind model evaluation is to send test samples (i.e. new, unknown samples the model was not trained with) through the model, and compare the model’s predictions versus the expected values. If all the model predictions match the expected values then the model is 100% accurate, and every wrong prediction decreases the model accuracy, e.g.:

Model Accuracy

Assuming the test samples are representative then the model accuracy should indicate how well it will perform in the real-world.

Model evaluation is done using the evaluate MLTK command. Along with accuracy, the evaluate command generates other statistics such as ROC-AUC and Precision & Recall.
Refer to the Model Evaluation Guide for more details about using the MLTK for model evaluation.

Command

To evaluate the newly trained model, issue the following command:

NOTE: Be sure to replace keyword_spotting_on_off with the name of your model.

# Run the model evaluation command
!mltk evaluate keyword_spotting_on_off --tflite --show
# For documentation purposes, we use the evaluate_model Python API so
# the evaluation plots are generated inline with the docs
from mltk.core import evaluate_model 
evaluation_results = evaluate_model('keyword_spotting_on_off', tflite=True, show=True)
print(f'{evaluation_results}')
../../_images/e7756bd7f69e870cc586b25ffb10f7090cc3ebb9ca5281c10c4feecaa5433884.png ../../_images/58cfbd5d469e2b66d370343bc8567bde52736fdc0748397a8ea8ad3af3beb8d9.png ../../_images/ae8ae185b687a8f9de0dca45bfd25d70d0d60ee3406e8e8f78b1a5bea946e4f6.png ../../_images/f8fbdb324457311032be0bb3d4e89e44138a3f068c8fe97085727f9a70b7c195.png ../../_images/932d28792224987d534b199bc061b220d1d2b22abe7cd06d225d7a406c405901.png
Name: keyword_spotting_on_off
Model Type: classification
Overall accuracy: 89.174%
Class accuracies:
- _silence_ = 100.000%
- _unknown_ = 88.981%
- on = 88.281%
- off = 87.200%
Average ROC AUC: 98.475%
Class ROC AUC:
- _silence_ = 100.000%
- off = 98.254%
- on = 98.116%
- _unknown_ = 97.528%

So in this case, our model has a 88.7% overall accuracy.

Once again, please refer to the Model Evaluation Guide for more details about the various metrics generated by this command.

Note about model accuracy

While model accuracy is highly application-specific, an 89% accuracy of a two-keyword classification model is considered good, not great. Typically, model accuracy should be in the 92%-97+% range for it to perform well in the field.

The following are things to keep in mind to improve the model accuracy:

  • Verify the dataset - Ensure all the samples are properly labeled and in a consistent format

  • Improve the feature engineering - Give the model the best chance to learn the patterns within the data by “amplifying” the signal (e.g. Improve the spectrogram quality)

  • Increase the model size - Increase the model size by adding more or wider layers (e.g. add more Conv2D filers)

Model Testing

NOTE: This section is experimental and is optional for the rest of this tutorial. You may safely skip to the next section.

During model evaluation, static audio samples are sent through the model to make predictions and determine its accuracy. While this is useful to obtain a consistent, baseline metric of model performance, this setup does not reflect the real-world application of the model.
In most real world applications, real-time audio is constantly streaming into the model. In this case, the same audio sample will pass multiple times through the model shifted in time each pass (see the Keyword Spotting Overview section above for more details.)

To help evaluate this scenario, the MLTK offers the command: classify_audio. With this command, the trained model can be used to classify keywords detected in streaming audio from a microphone. The classify_audio command features:

  • Support for executing a model on PC or embedded device

  • Support for dumping the spectrograms generated by the AudioFeatureGenerator

  • Support for recording audio

  • Support for adjusting the detection threshold

  • Support for viewing the model prediction results in real-time

NOTE: The classify_audio command must run locally. It will not work remotely (e.g. on Colab or remote SSH)

See the output of the command help for more details:

!mltk classify_audio --help
Usage: mltk classify_audio [OPTIONS] <model>

  Classify keywords/events detected in a microphone's streaming audio

  NOTE: This command is experimental. Use at your own risk!

  This command runs an audio classification application on either the local PC OR
  on an embedded target. The audio classification application loads the given 
  audio classification ML model (e.g. Keyword Spotting) and streams real-time audio
  from the local PC's/embedded target's microphone into the ML model.

  System Dataflow:
  Microphone -> AudioFeatureGenerator -> ML Model -> Command Recognizer -> Local Terminal  
 
  The audio classification application was adapted from TF-Lite Micro's "Micro Speech" 
  example:  
  https://github.com/tensorflow/tflite-micro/tree/main/tensorflow/lite/micro/examples/micro_speech
 
  The TFLM app was modified so that settings can be dynamically loaded from the command-line or
  given ML model.
 
  Refer to the mltk.models.tflite_micro.tflite_micro_speech model for a reference on how to train
  an ML model that works the audio classification application.
 
  ----------
   Examples
  ----------
 
  # Classify audio on local PC using tflite_micro_speech model   
  # Simulate the audio loop latency to be 200ms  
  # i.e. If the app was running on an embedded target, it would take 200ms per audio loop  
  # Also enable verbose logs  
  mltk classify_audio tflite_micro_speech --latency 200 --verbose 

  # Classify audio on an embedded target using model: ~/workspace/my_model.tflite   
  # and the following classifier settings:  
  # - Set the averaging window to 1200ms (i.e. drop samples older than <now> minus window)  
  # - Set the minimum sample count to 3 (i.e. must have at last 3 samples before classifying)  
  # - Set the threshold to 175 (i.e. the average of the inference results within the averaging window must be at least 175 of 255)  
  # - Set the suppression to 750ms (i.e. Once a keyword is detected, wait 750ms before detecting more keywords)  
  # i.e. If the app was running on an embedded target, it would take 200ms per audio loop  
  mltk classify_audio /home/john/my_model.tflite --device --window 1200ms --count 3 --threshold 175 --suppression 750  

  # Classify audio and also dump the captured raw audio and spectrograms  
  mltk classify_audio tflite_micro_speech --dump-audio --dump-spectrograms

Arguments:
  <model>  On of the following:
           - MLTK model name 
           - Path to .tflite file
           - Path to model archive file (.mltk.zip)
           NOTE: The model must have been previously trained for keyword spotting  [required]

Options:
  -a, --accelerator <name>        Name of accelerator to use while executing the audio classification ML model.
                                  If omitted, then use the reference kernels
                                  NOTE: It is recommended to NOT use an accelerator if running on the PC since the HW simulator can be slow.
  -d, --device                    If provided, then run the keyword spotting model on an embedded device, otherwise use the PC's local microphone.
                                  If this option is provided, then the device must be locally connected
  --port <port>                   Serial COM port of a locally connected embedded device.
                                  This is only used with the --device option.
                                  'If omitted, then attempt to automatically determine the serial COM port
  -v, --verbose                   Enable verbose console logs
  -w, --window_duration <duration ms>
                                  Controls the smoothing. Drop all inference results that are older than <now> minus window_duration.
                                  Longer durations (in milliseconds) will give a higher confidence that the results are correct, but may miss some commands
  -c, --count <count>             The *minimum* number of inference results to
                                  average when calculating the detection value
  -t, --threshold <threshold>     Minimum averaged model output threshold for
                                  a class to be considered detected, 0-255.
                                  Higher values increase precision at the cost
                                  of recall
  -s, --suppression <suppression ms>
                                  Amount of milliseconds to wait after a
                                  keyword is detected before detecting new
                                  keywords
  -l, --latency <latency ms>      This the amount of time in milliseconds
                                  between processing loops
  -m, --microphone <name>         For non-embedded, this specifies the name of
                                  the PC microphone to use
  -u, --volume <volume gain>      Set the volume gain scaler (i.e. amplitude)
                                  to apply to the microphone data. If 0 or
                                  omitted, no scaler is applied
  -x, --dump-audio                Dump the raw microphone and generate a
                                  corresponding .wav file
  -w, --dump-raw-spectrograms     Dump the raw (i.e. unquantized) generated
                                  spectrograms to .jpg images and .mp4 video
  -z, --dump-spectrograms         Dump the quantized generated spectrograms to
                                  .jpg images and .mp4 video
  -i, --sensitivity FLOAT         Sensitivity of the activity indicator LED.
                                  Much less than 1.0 has higher sensitivity
  --app <path>                    By default, the audio_classifier app is automatically downloaded. 
                                  This option allows for overriding with a custom built app.
                                  Alternatively, if using the --device option, set this option to "none" to NOT program the audio_classifier app to the device.
                                  In this case, ONLY the .tflite will be programmed and the existing audio_classifier app will be re-used.
  --test                          Run as a unit test
  --help                          Show this message and exit.

Classify audio on PC

Issue the following command to use your local PC’s microphone to stream real-time audio in our trained model:

NOTE: Be sure to replace keyword_spotting_on_off with your trained model’s name.

# Run the audio classification application
# Saying the keywords "on" or "off" into your PC's microphone should cause the model to detect them.
# HINT: Add the --verbose flag to view more info from the classification app
# NOTE: This command must be run from a local terminal
!mltk classify_audio keyword_spotting_on_off

Classify audio on device

Alternatively, we can run the audio classification application + model on a supported embedded device.
In this case, we use the embedded device’s microphone and the trained model executes on the embedded device.

To run on an embedded device, add the --device flag. When a keyword is detected an LED will turn on and a log will be printed to the console.

NOTE: A supported embedded device must be locally connected to run the command.

Note about DSP

Currently, your mouth must be ~2 inches from the board’s microphone for it to reliably detect keywords.
This is because the board microphone’s lack advanced Digital Signal Processing (DSP) features like Beamforming. A future release of the Gecko SDK will offer this feature to improve audio quality from longer distances.

# Run the audio classification application with MVP acceleration
# Remove the "--accelerator MVP" if your device does not support the MVP.
# Saying the keywords "on" or "off" into device's microphone should cause the model to detect them
# which will cause the LED to turn on and a message to be printed to the console.
# HINT: Add the --verbose flag to view more info from the classification app
# NOTE: This command must run from a local terminal
# NOTE: Your mouth must be ~2 inches from the board's microphone
!mltk classify_audio keyword_spotting_on_off --device --accelerator MVP

Record audio and spectrograms from device

Another useful feature of the classify_audio command is the ability to record audio and spectrograms from the embedded device.
This is done by adding the --dump-audio or --dump-spectrograms flags to the command.
When the command completes, a log directory will contain the dumped audio sound file (.wav) or dumped spectrogram images (.jpg)

# Dump audio from an embedded device's microphone
# The dumped audio will be found in the log directory:
# ~/.mltk/audio_classify_recordings/<platform>/audio
# NOTE: This command must run from a local terminal
!mltk classify_audio keyword_spotting_on_off --device --dump-audio
# Dump audio from an embedded device's microphone
# The dumped audio will be found in the log directory:
# ~/.mltk/audio_classify_recordings/<platform>/spectrograms
# NOTE: This command must run from a local terminal
!mltk classify_audio keyword_spotting_on_off --device --dump-spectrograms

Deploying the Model

Now that we have a trained model, it is time to run it in on an embedded device.

There are several different ways this can be done:

Using Simplicity Studio

The standard Gecko SDK also features an audio_classifier application.

The basic sequence for updating the app with a new model is:

  1. Using Simplicity Studio create a new “Audio Classifier” project

  2. Extract the .tflite model file from the MLTK Model Archive

  3. Copy the .tflite model file to a Gecko SDK project in Simplicity Studio, more details here

  4. Build the Gecko SDK project via Simplicity Studio

  5. Program the built firmware image to the embedded device

  6. Run the firmware image with trained model on the embedded device

When Simplicity Studio builds the project, it finds the .tflite model file and generates a C header file which contains a C uint8_t array of the binary data in the .tflite. The firmware then references the C array and loads it into the Tensorflow-Lite Micro Interpreter.

See the Getting Started with Machine Learning Gecko SDK guide for more details.

Using the MLTK

The MLTK supports building C++ Applications.

It also features an audio_classifier C++ application which can be built using:

Refer to the audio_classifier application’s documentation for how include your model into the built application.