{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Keyword Spotting - On/Off\n", "\n", "This tutorial describes how to use the MLTK to develop a machine learning model to detect the keywords:\n", "\n", "- __On__\n", "- __Off__" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Quick Links\n", "\n", "- [GitHub Source](https://github.com/SiliconLabs/mltk/blob/master/mltk/tutorials/keyword_spotting_on_off.ipynb) - View this tutorial on Github\n", "- [Train in the \"Cloud\"](../../mltk/tutorials/cloud_training_with_vast_ai.md) - _Vastly_ improve training times by training this model in the \"cloud\"\n", "- [C++ Example Application](../../docs/cpp_development/examples/audio_classifier.md) - View this tutorial's associated C++ example application\n", "- [Machine Learning Model](../../docs/python_api/models/siliconlabs/keyword_spotting_on_off_v3.md) - View this tutorial's associated machine learning model" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "### Objectives\n", "\n", "After completing this tutorial, you will have:\n", "1. A better understanding of how keyword-spotting (KWS) machine learning models work\n", "2. All of the tools needed to develop your own KWS machine learning model\n", "3. A working demo to turn an LED on/off based on the voice commands of your choice \n", "\n", "### Content\n", "\n", "This tutorial is divided into the following sections:\n", "1. [Overview of machine learning and keyword-spotting](#machine-learning-and-keyword-spotting-overview)\n", "2. [Dataset selection and preprocessing parameters](#dataset-selection-and-preprocessing-parameters)\n", "3. [Creating the model specification](#model-specification)\n", "4. [Visualizing the audio dataset](#audio-visualization)\n", "5. [Note about model parameters](#model-parameters)\n", "6. [Summarizing the model](#model-visualization)\n", "7. [Visualizing the model graph](#model-visualization)\n", "8. [Profiling the model](#model-profiler)\n", "9. [Training the model](#model-training)\n", "10. [Evaluating the model](#model-evaluation)\n", "11. [Testing the model](#model-testing)\n", "12. [Deploying the model to an embedded device](#deploying-the-model)\n", "\n", "\n", "### Running this tutorial from the command-line\n", "\n", "While this tutorial uses a [Jupyter Notebook](https://jupyter.org), \n", "the recommended approach is to use your favorite text editor and standard command terminal, no Jupyter Notebook required. \n", "\n", "See the [Standard Python Package Installation](https://siliconlabs.github.io/mltk/docs/installation.html#standard-python-package) guide for more details on how to enable the `mltk` command in your local terminal.\n", "\n", "In this mode, when you encounter a `!mltk` command in this tutorial, the command should actually run in your local terminal (excluding the `!`)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Install MLTK Python Package\n", "\n", "Before using the MLTK, it must first be installed. \n", "See the [Installation Guide](../../docs/installation.md) for more details." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install --upgrade silabs-mltk" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "All MLTK modeling operations are accessible via the `mltk` command. \n", "Run the command `mltk --help` to ensure it is working. \n", "__NOTE:__ The exclamation point `!` tells the Notebook to run a shell command, it is not required in a [standard terminal](https://siliconlabs.github.io/mltk/docs/installation.html#standard-python-package)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: mltk [OPTIONS] COMMAND [ARGS]...\n", "\n", " Silicon Labs Machine Learning Toolkit\n", "\n", " This is a Python package with command-line utilities and scripts to aid the\n", " development of machine learning models for Silicon Lab's embedded platforms.\n", "\n", "Options:\n", " --version Display the version of this mltk package and exit\n", " --help Show this message and exit.\n", "\n", "Commands:\n", " build MLTK build commands\n", " classify_audio Classify keywords/events detected in a microphone's...\n", " commander Silab's Commander Utility\n", " custom Custom Model Operations\n", " evaluate Evaluate a trained ML model\n", " profile Profile a model\n", " quantize Quantize a model into a .tflite file\n", " summarize Generate a summary of a model\n", " train Train an ML model\n", " update_params Update the parameters of a previously trained model\n", " utest Run the all unit tests\n", " view View an interactive graph of the given model in a...\n", " view_audio View the spectrograms generated by the...\n" ] } ], "source": [ "!mltk --help" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Machine Learning and Keyword-Spotting Overview\n", "\n", "Before continuing with this tutorial, it is recommended to review the following presentations: \n", "- [MLTK Overview](../../docs/overview.md) - An overview of the core concepts used by the this tutorial\n", "- [Keyword Spotting Overview](../../docs/audio/keyword_spotting_overview.md) - An overview of how keyword spotting works" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset Selection and Preprocessing Parameters\n", "\n", "Before starting the actual tutorial, let's first discuss datasets.\n", "\n", "### TL;DR \n", "\n", "1. A _representative_ dataset must be acquired for the trained model to perform well in the real-world\n", " - Having a representative \"unknown\" class is critical; detecting the \"known\" classes is easy; rejecting everything else is hard.\n", "2. The dataset should (typically) be transformed so that the model can efficiently learn the features of the dataset\n", "3. Whatever transformations are used must be identical at training-time on the PC and run-time on the embedded device\n", "4. The size of the dataset can be effectively increased by randomly augmenting it during training (changing the pitch, speed, adding background noise, etc.)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Acquire a Representative Dataset\n", "\n", "The most critical aspect of any machine learning model is the dataset. A _representative_ dataset is necessary to train a robust model.\n", "A model that is trained on a dataset that is too small and/or not representative of what would be seen in the real-world will likely not perform well.\n", "\n", "In this tutorial, we want to create a keyword spotting classification machine learning model. This implies the following about the dataset: \n", "- The dataset must contain audio samples of the keywords we want to detect\n", "- The dataset must be labelled, i.e. each sample in the dataset must have an associated \"class\", e.g. \"on\", \"off\"\n", "- The dataset must be relatively large and representative to account for the variance in spoken language (accents, background noise, etc.) \n", "\n", "For this tutorial, we'll use the [Google Speech Commands v2](https://www.tensorflow.org/datasets/catalog/speech_commands) dataset\n", "(__NOTE:__ This dataset is automatically downloaded in a later step in this tutorial). \n", "This dataset is effectively a directory of sub-directories, and each sub-directory contains thousands of 1s audio clips.\n", "The name of each sub-directory corresponds to the word being spoken in the audio clip, e.g:\n", "\n", "```console\n", "/dataset\n", "/dataset/on\n", "/dataset/on/sample1.wav\n", "/dataset/on/sample2.wav\n", "...\n", "/dataset/off\n", "/dataset/off/sample1.wav\n", "/dataset/off/sample2.wav\n", "...\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Synthetically Generated Dataset\n", "\n", "The [Google Speech Commands v2](https://www.tensorflow.org/datasets/catalog/speech_commands) is relatively small. The \"on\" and \"off\" classes only have about 3k samples.\n", "To create a robust model that works in the real-world, the dataset should have 10k+ samples (or even 100k+).\n", "\n", "However, creating a large dataset can be expensive. To help overcome this, we use the [AudioDatasetGenerator](../../docs/python_api/utils/audio_dataset_generator/index.md)\n", "utility that comes with the MLTK. This is a utility that automatically generates audio samples using the Google, Amazon, and Microsoft clouds.\n", "Refer to the [Synthetic Audio Dataset Generation](../../mltk/tutorials/synthetic_audio_dataset_generation.md) tutorial for more details.\n", "\n", "With this utility, we generate the [Synthetic On/off Dataset](../../docs/python_api/datasets/audio/on_off.md) which adds about 15k more samples to our training dataset." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Creating an \"Unknown\" Class\n", "\n", "Creating a model that can detect the \"known\" classes (i.e. \"on\" and \"off\") is relatively easy.\n", "Creating a model that can also reliably reject everything else is typically a much harder problem. \n", "For instance, consider that we are making a voice-controlled light switch that turns the lights on and off. In this case, the lights must only change with the keywords \"on\" and \"off\". \n", "The switch must ignore all other sounds. (It would make for a poor user experience if the lights changed while having a conversation next to the switch.) This is why the \"unknown\" class is critical.\n", "The model should predict the \"unknown\" class for every other sound that is not a \"known\" class. \n", "\n", "So to summarize:\n", "- __Known classes__ - The keywords we want to detection (i.e. \"on\" and \"off\")\n", "- __Unknown class__ - Every other possible sound that might be heard in the field (silence, other words, random household noises, etc.)\n", "\n", "To help create a _representative_ \"unknown\" class, we use several datasets:\n", "\n", "- [ML Commons Keywords](../../docs/python_api/datasets/audio/ml_commons/keywords.md) - Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages\n", "- [Environmental Sound Classification](../../docs/python_api/datasets/audio/background_noise/esc50.md) - Collection of 2k short clips comprising 50 classes of various common sound events" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset Summary\n", "\n", "This model was trained using several different datasets: \n", "\n", "- [mltk.datasets.audio.on_off](../../docs/python_api/datasets/audio/on_off.md) - Synthetically generated keywords: on, off\n", "- [mltk.datasets.audio.speech_commands_v2](../../docs/python_api/datasets/audio/speech_commands_v2.md) - Human generated keywords: on, off\n", "- [mltk.datasets.audio.mlcommons.ml_commons_keyword](../../docs/python_api/datasets/audio/ml_commons/keywords.md) - Large collection of keywords, random subset used for *unknown* class\n", "- [mltk.datasets.audio.background_noise.esc50](../../docs/python_api/datasets/audio/background_noise/esc50.md) - Collection of various noises, random subset used for *unknown* class\n", "- [mltk.datasets.audio.background_noise.ambient](../../docs/python_api/datasets/audio/background_noise/ambient.md) - Collection of various background noises, mixed into other samples for augmentation\n", "- [mltk.datasets.audio.background_noise.brd2601](../../docs/python_api/datasets/audio/background_noise/brd2601.md) - \"Silence\" recorded by BRD2601 microphone, mixed into other samples to make them \"sound\" like they \n", "- [mltk.datasets.audio.mit_ir_survey](../../docs/python_api/datasets/audio/mit_ir_survey.md) - Impulse responses that are randomly convolved with the samples. This makes the samples sound if they were recorded in different environments" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Final note about the dataset\n", "\n", "The combined datasets meet our requirements: \n", "- They contain audio samples of the keywords we want to detect (\"on\", \"off\")\n", "- The samples are labelled (all \"on\" samples are in the \"on\" sub-directory etc.)\n", "- The dataset is representative (the audio clips are taken from many different people saying the same words, _as well as_ randomly audio samples for the \"unknown\" or \"negative\" class)\n", "\n", "__NOTE:__ For many machine learning applications acquiring a dataset will not be so easy. \n", "Many times the dataset will suffer from one or more of the following:\n", "- The dataset does not exist - Need to manually collect samples\n", "- The raw samples exist but are not \"labeled\" - Need to manually group the samples\n", "- The dataset is \"dirty\" - Bad/corrupt samples, mislabeled samples\n", "- The dataset is not representative - Duplicate/similar samples, not diverse enough to cover the possible range seen in the real-world\n", "\n", "__NOTE:__ A clean, representative dataset is one of the best ways to train a robust model.\n", "It is _highly_ recommended to invest the time/energy to create a good dataset!" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Feature Engineering\n", "\n", "Along with a representative dataset, we (usually) need to transform the individual samples of the dataset\n", "so that the machine learning model can efficiently learn the \"features\" of the dataset, and thus make accurate predictions. \n", "This process is frequently called \"feature engineering\". One way of describing feature engineering is: \n", "Use human insight to amplify the signals of the dataset so that a machine can more efficiently learn the patterns in it.\n", "\n", "The transform(s) used for feature engineering are highly application-specific.\n", "\n", "For this tutorial, we use the common technique of converting the raw audio into a spectrogram (i.e. gray-scale image).\n", "The machine learning model then learns the patterns in the spectrogram images that correspond to the keywords in the audio samples." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Featuring Engineering on the Edge\n", "\n", "An important aspect to keep in mind about the transform(s) chosen for featuring engineering \n", "is that whatever is done to the dataset samples during training must also be\n", "done on the embedded device at run-time. i.e. The _exact_ algorithms used to generate the\n", "spectrogram on the PC during training _must_ be used on the embedded device at run-time.\n", "Any divergence will cause the embedded model to \"see\" different samples and likely not perform well (if at all).\n", "\n", "For this purpose, the MLTK offers an [Audio Feature Generator](../../docs/audio/audio_feature_generator.md) component. \n", "This component generates spectrograms from raw audio. The algorithms used in this component are accessible via:\n", "\n", "- MLTK [Python API](../../docs/python_api/data_preprocessing/audio_feature_generator.md)\n", "- Gecko SDK [firmware component](https://siliconlabs.github.io/mltk/docs/audio/audio_feature_generator.html#gecko-sdk-component)\n", "\n", "In this way, the _exact_ spectrogram generation algorithms used during training may also be used at\n", "run-time on the embedded device.\n", "\n", "Refer to the [Audio Feature Generator](../../docs/audio/audio_feature_generator.md) documentation\n", "and [Audio Visualization](#audio-visualization) section for more details on how the various parameters used to generate the spectrogram may be determined." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Data Augmentation\n", "\n", "A useful technique for expanding the size of a dataset (and hopefully making it more representative) is to apply random augmentations to the training samples.\n", "For instance, audio dataset augmentations might include:\n", "- Increase/decrease speed\n", "- Increase/decrease pitch\n", "- Add random background noises\n", "- Applying an impulse response\n", "- Cropping \"known\" samples and adding to the \"unknown\" class\n", "\n", "In this way, the model never \"sees\" the same sample during training which should hopefully make it robust\n", "as it has learned from a larger collection of samples." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Random Impulse Response\n", "\n", "Another way of making an audio sample sound different is to apply an \"impulse response\" to it. The impulse response can make the audio sound as if it was captured in a different environment (e.g. in a church, in a field, etc.).\n", "\n", "To do this, we randomly apply impulse responses from the [MIT Impulse Response Survey](https://mcdermottlab.mit.edu/Reverb/IR_Survey.html)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Random \"unknown\" samples by cropping \"known\" samples\n", "\n", "On the device, audio is constantly streaming from the microphone. As such, there may be cases where the audio sample is only partially buffered when it is classified by the model. To account for this, the \"known\" samples are randomly cropped and applied to the \"unknown\" classes. This way, the model considers partially buffered \"known\" samples to be \"unknown\"." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Model Specification\n", "\n", "The model specification is a standard Python script containing everything needed to build, train, and evaluate a machine learning model in the MLTK.\n", "\n", "Refer to the [Model Specification Guide](../../docs/guides/model_specification.md) for more details about this file.\n", "\n", "The completed model specification used for this tutorial may be found on Github: [keyword_spotting_on_off_v3.py](https://github.com/siliconlabs/mltk/blob/master/mltk/models/siliconlabs/keyword_spotting_on_off_v3.py). \n", "\n", "It is recommended to copy & paste [keyword_spotting_on_off_v3.py](https://github.com/siliconlabs/mltk/blob/master/mltk/models/siliconlabs/keyword_spotting_on_off_v3.py) into your local [MLTK Python environment](https://siliconlabs.github.io/mltk/docs/installation.html#standard-python-package)\n", "\n", "The following sub-sections provide _code snippets_ from the [keyword_spotting_on_off_v3.py](https://github.com/siliconlabs/mltk/blob/master/mltk/models/siliconlabs/keyword_spotting_on_off_v3.py) model specification script:" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Define Model Object\n", "\n", "Near the top of the model specification script, are the lines:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# @mltk_model\n", "class MyModel(\n", " mltk_core.MltkModel, # We must inherit the MltkModel class\n", " mltk_core.TrainMixin, # We also inherit the TrainMixin since we want to train this model\n", " mltk_core.DatasetMixin, # We also need the DatasetMixin mixin to provide the relevant dataset properties\n", " mltk_core.EvaluateClassifierMixin, # While not required, also inherit EvaluateClassifierMixin to help will generating evaluation stats for our classification model\n", "):\n", " pass\n", "# Instantiate our custom model object\n", "# The rest of this script simply configures the properties\n", "# of our custom model object\n", "my_model = MyModel()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "This defines and instantiates a custom MltkModel object with several model \"mixins\". \n", "\n", "The custom model object must inherit the [MltkModel](../../docs/python_api/mltk_model/index.md) object. \n", "Additionally, it inherits:\n", "- [TrainMixin](../../docs/python_api/mltk_model/train_mixin.md) so that we can train the model\n", "- [DatasetMixin](../../docs/python_api/mltk_model/dataset_mixin.md) so that we get additional dataset properties\n", "- [EvaluateClassifierMixin](../../docs/python_api/mltk_model/evaluate_classifier_mixin.md) so that we can evaluate the trained model\n", "\n", "The rest of the model specification script configures the various properties of our custom model object." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Configure the general model settings" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# For better tracking, the version should be incremented any time a non-trivial change is made\n", "# NOTE: The version is optional and not used directly used by the MLTK\n", "my_model.version = 1 \n", "# Provide a brief description about what this model models\n", "# This description goes in the \"description\" field of the .tflite model file\n", "my_model.description = 'Keyword spotting classifier to detect: \"on\" and \"off\"'" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Configure the basic training settings\n", "\n", "Refer to the [TrainMixin](../../docs/python_api/mltk_model/train_mixin.md) for more details about each property." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# This specifies the number of times we run the training.\n", "# We just set this to a large value since we're using SteppedLearnRateScheduler\n", "# to control when training completes\n", "my_model.epochs = 9999\n", "# Specify how many samples to pass through the model\n", "# before updating the training gradients.\n", "# Typical values are 10-64\n", "# NOTE: Larger values require more memory and may not fit on your GPU\n", "my_model.batch_size = 100" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Configure the training callbacks\n", "\n", "Refer to the [TrainMixin](../../docs/python_api/mltk_model/train_mixin.md) for more details about each property." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# The MLTK enables the tf.keras.callbacks.ModelCheckpoint by default.\n", "my_model.checkpoint['monitor'] = 'val_accuracy'\n", "\n", "\n", "# We use a custom learn rate schedule that is defined in:\n", "# https://github.com/google-research/google-research/tree/master/kws_streaming\n", "my_model.train_callbacks = [\n", " tf.keras.callbacks.TerminateOnNaN(),\n", " SteppedLearnRateScheduler([\n", " (100, .001),\n", " (100, .002),\n", " (100, .003),\n", " (100, .004),\n", " (10000, .005),\n", " (10000, .002),\n", " (5000, .0005),\n", " (5000, 1e-5),\n", " (5000, 1e-6),\n", " (5000, 1e-7),\n", " ] )\n", "]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Configure the TF-Lite Converter settings\n", "\n", "The [Tensorflow-Lite Converter](https://www.tensorflow.org/lite/convert) is used to \"quantize\" the model. \n", "The quantized model is what is eventually programmed to the embedded device.\n", "\n", "Refer to the [Model Quantization Guide](../../docs/guides/model_quantization.md) for more details." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# These are the settings used to quantize the model\n", "# We want all the internal ops as well as\n", "# model input/output to be int8\n", "my_model.tflite_converter['optimizations'] = [tf.lite.Optimize.DEFAULT]\n", "my_model.tflite_converter['supported_ops'] = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]\n", "my_model.tflite_converter['inference_input_type'] = np.int8\n", "my_model.tflite_converter['inference_output_type'] = np.int8\n", "# Automatically generate a representative dataset from the validation data\n", "my_model.tflite_converter['representative_dataset'] = 'generate'" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Define the model architecture\n", "\n", "The model is based on the [Temporal efficient neural network (TENet)](https://arxiv.org/pdf/2010.09960.pdf) model architecture. \n", "> A network for processing spectrogram data using temporal and depthwise convolutions. The network treats the [T, F] spectrogram as a timeseries shaped [T, 1, F].\n", "\n", "This model was chosen because it has good accuracy for audio datasets and executes efficiently on the EFR32xG24 MCU.\n", "\n", "More details at [mltk.models.shared.tenet.TENet](../../docs/python_api/models/common_models.md#tenet)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def my_model_builder(model: MyModel) -> tf.keras.Model:\n", " \"\"\"Build the Keras model\n", " \"\"\"\n", " input_shape = model.input_shape\n", " # NOTE: This model requires the input shape: \n", " # while the embedded device expects: \n", " # Since the