Audio Feature Generator

The AudioFeatureGenerator is a software library to convert streaming audio into spectrograms. The spectrograms are then used by a classification machine learning model to make predictions on the contents of the streaming audio.

A common use case of this library is “keyword spotting”.
Refer to the Keyword Spotting Overview for more details on how spectrograms are used to detect keywords in streaming audio.

Refer to the Keyword Spotting Tutorial for a complete guide on how to use the MLTK to create an audio classification ML model.

Overview

There are three main parts to the AudioFeatureGenerator:

  • Gecko SDK Component - Software library provided by the Gecko SDK and runs on the an embedded target

  • MLTK C++ Python Wrapper - Python package that wraps the Gecko SDK software library; this runs on the host PC

  • Audio Visualizer Utility - Graphical utility to view the spectrograms generated by the AudioFeatureGenerator in real-time

Note

See the Audio Utilities documentation for more details about the audio tools offered by the MLTK

These parts work together as follows:

  1. The AudioFeatureGenerator visualizer tool is used to select spectrogram settings

    • The mltk view_audio command is used to invoke visualizer tool

  2. The spectrogram settings are saved to a Model Specification file

  3. The Model Specification file is used to train the model

    • The mltk train command is used to train the model

    • Internally, the AudioFeatureGenerator C++ Python wrapper is used to dynamically generate spectrograms from the audio dataset

  4. At the end of training, the MLTK embeds the spectrogram settings into the generated .tflite model file

  5. The generated .tflite model file is copied to a Gecko SDK project

  6. The Gecko SDK project generator parses the spectrogram settings embedded in the .tflite and generates the corresponding C header files with the settings

  7. The Gecko SDK project is built and the firmware image is loaded onto the embedded target. The firmware image contains:

  8. On the embedded target at runtime:
    a. Read streaming audio from the microphone
    b. The microphone audio is sent to the AudioFeatureGenerator where spectrograms are generated using the exact same settings and algorithms that were used during model training
    c. The generated spectrogram images are sent to Tensorflow-Lite Micro and are classified using the .tflite model
    d. The model predictions are used to notify the application of keyword detections

Benefits

The benefits of using the AudioFeatureGenerator are:

  • The exact same algorithms and settings used to generate the spectrograms during model training are also used by the embedded target

    • This ensures the ML model “sees” the same type of spectrograms at runtime that it was trained to see which should allow for better performance

  • The spectrogram settings are automatically embedded into the .tflite model file

    • This ensures the settings are in lock-step with the trained model

    • The ML model designer only needs to distribute a single file

  • The Gecko SDK will automatically generate the necessary source code

    • The Gecko SDK will parse the spectrogram settings from the .tflite and generate the corresponding C headers

    • The Gecko SDK comes with the full source code to the AudioFeatureGenerator software library

Gecko SDK Component

The Gecko SDK AudioFeatureGenerator component is largely based on the Google Microfrontend library.

A feature generation library (also called frontend) that receives raw audio input, and produces filter banks (a vector of values).

The raw audio input is expected to be 16-bit PCM features, with a configurable sample rate. More specifically the audio signal goes through a pre-emphasis filter (optionally); then gets sliced into (potentially overlapping) frames and a window function is applied to each frame; afterwards, we do a Fourier transform on each frame (or more specifically a Short-Time Fourier Transform) and calculate the power spectrum; and subsequently compute the filter banks.

Source Code

The Gecko SDK features an AudioFeatureGeneration component.
The MLTK also features the same component with slight modifications so that it can be built for Windows/Linux.

MLTK C++ Python Wrapper

The C++ Python wrapper allows for executing the AudioFeatureGenerator component from a Python script. This allows for executing the AudioFeatureGenerator software library during model training. This is useful because the exact spectrogram generation algorithms used by the embedded device at runtime may also be used during model training which should (hopefully) lead to more accurate model predictions.

The MLTK uses pybind11 to wrap the AudioFeatureGenerator software library and generate a Windows/Linux binary that can be loaded into the Python runtime environment.

The AudioFeatureGenerator Python API docs may be found here: mltk.core.preprocess.audio.audio_feature_generator.AudioFeatureGenerator.

Source Code

Note

When installing the MLTK for local development, the C++ wrapper is automatically built into a Windows/Linux shared library (.dll / .so) and copied to the Python directory. When the AudioFeatureGenerator Python library is invoked by your Python scripts, the C++ wrapper shared library is loaded into the Python runtime environment.

Usage

The recommended way of using the AudioFeatureGenerator C++ wrapper is by calling the mltk.core.preprocess.utils.audio.apply_frontend() API.

Refer to the keyword_spotting_on_off_v3.py model specification for an example of how this is used.

Basically,

1 ) In your model specification file, define a model object to inherit the DatasetMixin, e.g.:

class MyModel(
    MltkModel, 
    TrainMixin, 
    DatasetMixin, 
    EvaluateClassifierMixin
):
    pass

2 ) In your model specification file, configure the spectrogram settings, e.g:


frontend_settings = AudioFeatureGeneratorSettings()

frontend_settings.sample_rate_hz = 16000
frontend_settings.sample_length_ms = 1000                       # A 1s buffer should be enough to capture the keywords
frontend_settings.window_size_ms = 30
frontend_settings.window_step_ms = 10
frontend_settings.filterbank_n_channels = 104                   # We want this value to be as large as possible
                                                                # while still allowing for the ML model to execute efficiently on the hardware
frontend_settings.filterbank_upper_band_limit = 7500.0
frontend_settings.filterbank_lower_band_limit = 125.0           # The dev board mic seems to have a lot of noise at lower frequencies

frontend_settings.noise_reduction_enable = True                 # Enable the noise reduction block to help ignore background noise in the field
frontend_settings.noise_reduction_smoothing_bits = 10
frontend_settings.noise_reduction_even_smoothing =  0.025
frontend_settings.noise_reduction_odd_smoothing = 0.06
frontend_settings.noise_reduction_min_signal_remaining = 0.40   # This value is fairly large (which makes the background noise reduction small)
                                                                # But it has been found to still give good results
                                                                # i.e. There is still some background noise reduction,
                                                                # but the actual signal is still (mostly) untouched

frontend_settings.dc_notch_filter_enable = True                 # Enable the DC notch filter, to help remove the DC signal from the dev board's mic
frontend_settings.dc_notch_filter_coefficient = 0.95

frontend_settings.quantize_dynamic_scale_enable = True          # Enable dynamic quantization, this dynamically converts the uint16 spectrogram to int8
frontend_settings.quantize_dynamic_scale_range_db = 40.0

# Add the Audio Feature generator settings to the model parameters
# This way, they are included in the generated .tflite model file
# See https://siliconlabs.github.io/mltk/docs/guides/model_parameters.html
my_model.model_parameters.update(frontend_settings)

3 ) Configure the your data pipeline to call the frontend:

from mltk.core.preprocess.utils import audio as audio_utils

spectrogram = audio_utils.apply_frontend(
   sample=augmented_sample,
   settings=frontend_settings,
   dtype=np.int8
)

During model training, spectrograms will be dynamically generated from the dataset’s audio samples using the AudioFeatureGenerator via C++ Python wrapper.

At the end of training, the spectrogram settings are automatically embedded into the generated .tflite model file.

Audio Visualizer Utility

The Audio Visualizer Utility provides a graphical interface to the C++ Python wrapper and thus Gecko SDK AudioFeatureGenerator software library. It allows for adjusting the various spectrogram settings and seeing how the resulting spectrogram is affected in real-time.

To use the Audio Visualizer utility, issue the command:

mltk view_audio

NOTE: Internally, this will install the wxPython Python package.

audio_visualizer