keyword_spotting_alexa¶

Source code: keyword_spotting_alexa.py
Pre-trained model: keyword_spotting_alexa.mltk.zip

This model is designed to detect the keyword: “Alexa”.

It based on the Temporal Efficient Neural Network (TENet) model architecture:

This is a keyword spotting architecture with temporal and depthwise convolutions.

This model specification script is designed to work with the Keyword Spotting Alexa tutorial.

Dataset¶

This combines several different datasets:

A synthetically generated “Alexa” dataset - Different computer-generated audio clips of the keyword “alexa”
A synthetically generated “unknown” class - Different computer-generated audio clips that sound similar to “alexa”; used for the “unknown” class; helps avoid false-positives
A subset of the MLCommons Multilingual Spoken Words dataset - Used for the “unknown” class; helps to avoid false-positives
A subset of the Mozilla Common Voice dataset - Used for the “unknown” class; helps to avoid false-positives

Preprocessing¶

This uses the mltk.core.preprocess.audio.audio_feature_generator.AudioFeatureGenerator to generate spectrograms with the settings:

sample_rate: 16kHz
sample_length: 1200ms
window size: 30ms
window step: 10ms
n_channels: 108
noise_reduction_enable: 1
noise_reduction_min_signal_remaining: 0.40

Commands¶

# Do a "dry run" test training of the model
mltk train keyword_spotting_alexa-test

# Train the model
mltk train keyword_spotting_alexa

# Evaluate the trained model .tflite model
mltk evaluate keyword_spotting_alexa --tflite

# Profile the model in the MVP hardware accelerator simulator
mltk profile keyword_spotting_alexa --accelerator MVP

# Profile the model on a physical development board
mltk profile keyword_spotting_alexa  --accelerator MVP --device

# Run the model in the audio classifier on the local PC
mltk classify_audio keyword_spotting_alexa --verbose

# Run the model in the audio classifier on the physical device feature an MVP hardware accelerator
mltk classify_audio keyword_spotting_alexa --device  --accelerator MVP --verbose

Model Summary¶

mltk summarize keyword_spotting_alexa --tflite

+-------+-------------------+------------------+-----------------+-------------------------------------------------------+
| Index | OpCode            | Input(s)         | Output(s)       | Config                                                |
+-------+-------------------+------------------+-----------------+-------------------------------------------------------+
| 0     | conv_2d           | 118x1x108 (int8) | 118x1x32 (int8) | Padding:Same stride:1x1 activation:None               |
|       |                   | 3x1x108 (int8)   |                 |                                                       |
|       |                   | 32 (int32)       |                 |                                                       |
| 1     | conv_2d           | 118x1x32 (int8)  | 118x1x96 (int8) | Padding:Valid stride:1x1 activation:Relu              |
|       |                   | 1x1x32 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 2     | depthwise_conv_2d | 118x1x96 (int8)  | 59x1x96 (int8)  | Multiplier:1 padding:Same stride:2x2 activation:Relu  |
|       |                   | 9x1x96 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 3     | conv_2d           | 59x1x96 (int8)   | 59x1x32 (int8)  | Padding:Valid stride:1x1 activation:None              |
|       |                   | 1x1x96 (int8)    |                 |                                                       |
|       |                   | 32 (int32)       |                 |                                                       |
| 4     | conv_2d           | 118x1x32 (int8)  | 59x1x32 (int8)  | Padding:Same stride:2x2 activation:Relu               |
|       |                   | 1x1x32 (int8)    |                 |                                                       |
|       |                   | 32 (int32)       |                 |                                                       |
| 5     | add               | 59x1x32 (int8)   | 59x1x32 (int8)  | Activation:Relu                                       |
|       |                   | 59x1x32 (int8)   |                 |                                                       |
| 6     | conv_2d           | 59x1x32 (int8)   | 59x1x96 (int8)  | Padding:Valid stride:1x1 activation:Relu              |
|       |                   | 1x1x32 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 7     | depthwise_conv_2d | 59x1x96 (int8)   | 59x1x96 (int8)  | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
|       |                   | 9x1x96 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 8     | conv_2d           | 59x1x96 (int8)   | 59x1x32 (int8)  | Padding:Valid stride:1x1 activation:None              |
|       |                   | 1x1x96 (int8)    |                 |                                                       |
|       |                   | 32 (int32)       |                 |                                                       |
| 9     | add               | 59x1x32 (int8)   | 59x1x32 (int8)  | Activation:Relu                                       |
|       |                   | 59x1x32 (int8)   |                 |                                                       |
| 10    | conv_2d           | 59x1x32 (int8)   | 59x1x96 (int8)  | Padding:Valid stride:1x1 activation:Relu              |
|       |                   | 1x1x32 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 11    | depthwise_conv_2d | 59x1x96 (int8)   | 59x1x96 (int8)  | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
|       |                   | 9x1x96 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 12    | conv_2d           | 59x1x96 (int8)   | 59x1x32 (int8)  | Padding:Valid stride:1x1 activation:None              |
|       |                   | 1x1x96 (int8)    |                 |                                                       |
|       |                   | 32 (int32)       |                 |                                                       |
| 13    | add               | 59x1x32 (int8)   | 59x1x32 (int8)  | Activation:Relu                                       |
|       |                   | 59x1x32 (int8)   |                 |                                                       |
| 14    | conv_2d           | 59x1x32 (int8)   | 59x1x96 (int8)  | Padding:Valid stride:1x1 activation:Relu              |
|       |                   | 1x1x32 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 15    | depthwise_conv_2d | 59x1x96 (int8)   | 59x1x96 (int8)  | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
|       |                   | 9x1x96 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 16    | conv_2d           | 59x1x96 (int8)   | 59x1x32 (int8)  | Padding:Valid stride:1x1 activation:None              |
|       |                   | 1x1x96 (int8)    |                 |                                                       |
|       |                   | 32 (int32)       |                 |                                                       |
| 17    | add               | 59x1x32 (int8)   | 59x1x32 (int8)  | Activation:Relu                                       |
|       |                   | 59x1x32 (int8)   |                 |                                                       |
| 18    | conv_2d           | 59x1x32 (int8)   | 59x1x96 (int8)  | Padding:Valid stride:1x1 activation:Relu              |
|       |                   | 1x1x32 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 19    | depthwise_conv_2d | 59x1x96 (int8)   | 30x1x96 (int8)  | Multiplier:1 padding:Same stride:2x2 activation:Relu  |
|       |                   | 9x1x96 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 20    | conv_2d           | 30x1x96 (int8)   | 30x1x32 (int8)  | Padding:Valid stride:1x1 activation:None              |
|       |                   | 1x1x96 (int8)    |                 |                                                       |
|       |                   | 32 (int32)       |                 |                                                       |
| 21    | conv_2d           | 59x1x32 (int8)   | 30x1x32 (int8)  | Padding:Same stride:2x2 activation:Relu               |
|       |                   | 1x1x32 (int8)    |                 |                                                       |
|       |                   | 32 (int32)       |                 |                                                       |
| 22    | add               | 30x1x32 (int8)   | 30x1x32 (int8)  | Activation:Relu                                       |
|       |                   | 30x1x32 (int8)   |                 |                                                       |
| 23    | conv_2d           | 30x1x32 (int8)   | 30x1x96 (int8)  | Padding:Valid stride:1x1 activation:Relu              |
|       |                   | 1x1x32 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 24    | depthwise_conv_2d | 30x1x96 (int8)   | 30x1x96 (int8)  | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
|       |                   | 9x1x96 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 25    | conv_2d           | 30x1x96 (int8)   | 30x1x32 (int8)  | Padding:Valid stride:1x1 activation:None              |
|       |                   | 1x1x96 (int8)    |                 |                                                       |
|       |                   | 32 (int32)       |                 |                                                       |
| 26    | add               | 30x1x32 (int8)   | 30x1x32 (int8)  | Activation:Relu                                       |
|       |                   | 30x1x32 (int8)   |                 |                                                       |
| 27    | conv_2d           | 30x1x32 (int8)   | 30x1x96 (int8)  | Padding:Valid stride:1x1 activation:Relu              |
|       |                   | 1x1x32 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 28    | depthwise_conv_2d | 30x1x96 (int8)   | 30x1x96 (int8)  | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
|       |                   | 9x1x96 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 29    | conv_2d           | 30x1x96 (int8)   | 30x1x32 (int8)  | Padding:Valid stride:1x1 activation:None              |
|       |                   | 1x1x96 (int8)    |                 |                                                       |
|       |                   | 32 (int32)       |                 |                                                       |
| 30    | add               | 30x1x32 (int8)   | 30x1x32 (int8)  | Activation:Relu                                       |
|       |                   | 30x1x32 (int8)   |                 |                                                       |
| 31    | conv_2d           | 30x1x32 (int8)   | 30x1x96 (int8)  | Padding:Valid stride:1x1 activation:Relu              |
|       |                   | 1x1x32 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 32    | depthwise_conv_2d | 30x1x96 (int8)   | 30x1x96 (int8)  | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
|       |                   | 9x1x96 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 33    | conv_2d           | 30x1x96 (int8)   | 30x1x32 (int8)  | Padding:Valid stride:1x1 activation:None              |
|       |                   | 1x1x96 (int8)    |                 |                                                       |
|       |                   | 32 (int32)       |                 |                                                       |
| 34    | add               | 30x1x32 (int8)   | 30x1x32 (int8)  | Activation:Relu                                       |
|       |                   | 30x1x32 (int8)   |                 |                                                       |
| 35    | conv_2d           | 30x1x32 (int8)   | 30x1x96 (int8)  | Padding:Valid stride:1x1 activation:Relu              |
|       |                   | 1x1x32 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 36    | depthwise_conv_2d | 30x1x96 (int8)   | 15x1x96 (int8)  | Multiplier:1 padding:Same stride:2x2 activation:Relu  |
|       |                   | 9x1x96 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 37    | conv_2d           | 15x1x96 (int8)   | 15x1x32 (int8)  | Padding:Valid stride:1x1 activation:None              |
|       |                   | 1x1x96 (int8)    |                 |                                                       |
|       |                   | 32 (int32)       |                 |                                                       |
| 38    | conv_2d           | 30x1x32 (int8)   | 15x1x32 (int8)  | Padding:Same stride:2x2 activation:Relu               |
|       |                   | 1x1x32 (int8)    |                 |                                                       |
|       |                   | 32 (int32)       |                 |                                                       |
| 39    | add               | 15x1x32 (int8)   | 15x1x32 (int8)  | Activation:Relu                                       |
|       |                   | 15x1x32 (int8)   |                 |                                                       |
| 40    | conv_2d           | 15x1x32 (int8)   | 15x1x96 (int8)  | Padding:Valid stride:1x1 activation:Relu              |
|       |                   | 1x1x32 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 41    | depthwise_conv_2d | 15x1x96 (int8)   | 15x1x96 (int8)  | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
|       |                   | 9x1x96 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 42    | conv_2d           | 15x1x96 (int8)   | 15x1x32 (int8)  | Padding:Valid stride:1x1 activation:None              |
|       |                   | 1x1x96 (int8)    |                 |                                                       |
|       |                   | 32 (int32)       |                 |                                                       |
| 43    | add               | 15x1x32 (int8)   | 15x1x32 (int8)  | Activation:Relu                                       |
|       |                   | 15x1x32 (int8)   |                 |                                                       |
| 44    | conv_2d           | 15x1x32 (int8)   | 15x1x96 (int8)  | Padding:Valid stride:1x1 activation:Relu              |
|       |                   | 1x1x32 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 45    | depthwise_conv_2d | 15x1x96 (int8)   | 15x1x96 (int8)  | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
|       |                   | 9x1x96 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 46    | conv_2d           | 15x1x96 (int8)   | 15x1x32 (int8)  | Padding:Valid stride:1x1 activation:None              |
|       |                   | 1x1x96 (int8)    |                 |                                                       |
|       |                   | 32 (int32)       |                 |                                                       |
| 47    | add               | 15x1x32 (int8)   | 15x1x32 (int8)  | Activation:Relu                                       |
|       |                   | 15x1x32 (int8)   |                 |                                                       |
| 48    | conv_2d           | 15x1x32 (int8)   | 15x1x96 (int8)  | Padding:Valid stride:1x1 activation:Relu              |
|       |                   | 1x1x32 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 49    | depthwise_conv_2d | 15x1x96 (int8)   | 15x1x96 (int8)  | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
|       |                   | 9x1x96 (int8)    |                 |                                                       |
|       |                   | 96 (int32)       |                 |                                                       |
| 50    | conv_2d           | 15x1x96 (int8)   | 15x1x32 (int8)  | Padding:Valid stride:1x1 activation:None              |
|       |                   | 1x1x96 (int8)    |                 |                                                       |
|       |                   | 32 (int32)       |                 |                                                       |
| 51    | add               | 15x1x32 (int8)   | 15x1x32 (int8)  | Activation:Relu                                       |
|       |                   | 15x1x32 (int8)   |                 |                                                       |
| 52    | average_pool_2d   | 15x1x32 (int8)   | 1x1x32 (int8)   | Padding:Valid stride:1x15 filter:1x15 activation:None |
| 53    | reshape           | 1x1x32 (int8)    | 32 (int8)       | Type=none                                             |
|       |                   | 2 (int32)        |                 |                                                       |
| 54    | fully_connected   | 32 (int8)        | 2 (int8)        | Activation:None                                       |
|       |                   | 32 (int8)        |                 |                                                       |
|       |                   | 2 (int32)        |                 |                                                       |
| 55    | softmax           | 2 (int8)         | 2 (int8)        | Type=softmaxoptions                                   |
+-------+-------------------+------------------+-----------------+-------------------------------------------------------+
Total MACs: 4.562 M
Total OPs: 9.247 M
Name: keyword_spotting_alexa_v2
Version: 2
Description: Keyword spotting classifier to detect: "alexa"
Classes: alexa, _unknown_
Runtime memory size (RAM): 54.344 k
hash: 026c2f86bf499c3a1386c348888021e5
date: 2022-12-10T00:29:35.325Z
fe.sample_rate_hz: 16000
fe.fft_length: 512
fe.sample_length_ms: 1200
fe.window_size_ms: 30
fe.window_step_ms: 10
fe.filterbank_n_channels: 108
fe.filterbank_upper_band_limit: 7500.0
fe.filterbank_lower_band_limit: 125.0
fe.noise_reduction_enable: True
fe.noise_reduction_smoothing_bits: 10
fe.noise_reduction_even_smoothing: 0.02500000037252903
fe.noise_reduction_odd_smoothing: 0.05999999865889549
fe.noise_reduction_min_signal_remaining: 0.4000000059604645
fe.pcan_enable: False
fe.pcan_strength: 0.949999988079071
fe.pcan_offset: 80.0
fe.pcan_gain_bits: 21
fe.log_scale_enable: True
fe.log_scale_shift: 6
fe.activity_detection_enable: False
fe.activity_detection_alpha_a: 0.5
fe.activity_detection_alpha_b: 0.800000011920929
fe.activity_detection_arm_threshold: 0.75
fe.activity_detection_trip_threshold: 0.800000011920929
fe.dc_notch_filter_enable: True
fe.dc_notch_filter_coefficient: 0.949999988079071
fe.quantize_dynamic_scale_enable: True
fe.quantize_dynamic_scale_range_db: 40.0
latency_ms: 200
minimum_count: 2
average_window_duration_ms: 440
detection_threshold: 216
suppression_ms: 900
volume_gain: 0
verbose_model_output_logs: True
.tflite file size: 208.1kB

Model Profiling Report¶

# Profile on physical EFR32xG24 using MVP accelerator
mltk profile keyword_spotting_alexa --device --accelerator MVP

 Profiling Summary
 Name: keyword_spotting_alexa
 Accelerator: MVP
 Input Shape: 1x118x1x108
 Input Data Type: int8
 Output Shape: 1x2
 Output Data Type: int8
 Flash, Model File Size (bytes): 207.4k
 RAM, Runtime Memory Size (bytes): 65.1k
 Operation Count: 9.4M
 Multiply-Accumulate Count: 4.6M
 Layer Count: 56
 Unsupported Layer Count: 0
 Accelerator Cycle Count: 4.1M
 CPU Cycle Count: 825.0k
 CPU Utilization (%): 18.6
 Clock Rate (hz): 78.0M
 Time (s): 57.0m
 Ops/s: 165.5M
 MACs/s: 80.0M
 Inference/s: 17.5

 Model Layers
 +-------+-------------------+--------+--------+------------+------------+----------+---------------------------+--------------+-------------------------------------------------------+
 | Index | OpCode            | # Ops  | # MACs | Acc Cycles | CPU Cycles | Time (s) | Input Shape               | Output Shape | Options                                               |
 +-------+-------------------+--------+--------+------------+------------+----------+---------------------------+--------------+-------------------------------------------------------+
 | 0     | conv_2d           | 2.5M   | 1.2M   | 930.7k     | 11.3k      | 11.8m    | 1x118x1x108,32x3x1x108,32 | 1x118x1x32   | Padding:Same stride:1x1 activation:None               |
 | 1     | conv_2d           | 759.0k | 362.5k | 307.9k     | 5.2k       | 3.9m     | 1x118x1x32,96x1x1x32,96   | 1x118x1x96   | Padding:Valid stride:1x1 activation:Relu              |
 | 2     | depthwise_conv_2d | 118.9k | 51.0k  | 91.9k      | 88.9k      | 1.6m     | 1x118x1x96,1x9x1x96,96    | 1x59x1x96    | Multiplier:1 padding:Same stride:2x2 activation:Relu  |
 | 3     | conv_2d           | 364.4k | 181.2k | 145.7k     | 5.3k       | 1.9m     | 1x59x1x96,32x1x1x96,32    | 1x59x1x32    | Padding:Valid stride:1x1 activation:None              |
 | 4     | conv_2d           | 126.5k | 60.4k  | 52.9k      | 5.1k       | 690.0u   | 1x118x1x32,32x1x1x32,32   | 1x59x1x32    | Padding:Same stride:2x2 activation:Relu               |
 | 5     | add               | 1.9k   | 0      | 6.6k       | 2.8k       | 90.0u    | 1x59x1x32,1x59x1x32       | 1x59x1x32    | Activation:Relu                                       |
 | 6     | conv_2d           | 379.5k | 181.2k | 154.0k     | 5.2k       | 2.0m     | 1x59x1x32,96x1x1x32,96    | 1x59x1x96    | Padding:Valid stride:1x1 activation:Relu              |
 | 7     | depthwise_conv_2d | 118.9k | 51.0k  | 90.4k      | 88.7k      | 1.6m     | 1x59x1x96,1x9x1x96,96     | 1x59x1x96    | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
 | 8     | conv_2d           | 364.4k | 181.2k | 145.7k     | 5.3k       | 1.9m     | 1x59x1x96,32x1x1x96,32    | 1x59x1x32    | Padding:Valid stride:1x1 activation:None              |
 | 9     | add               | 1.9k   | 0      | 6.6k       | 2.7k       | 120.0u   | 1x59x1x32,1x59x1x32       | 1x59x1x32    | Activation:Relu                                       |
 | 10    | conv_2d           | 379.5k | 181.2k | 154.0k     | 5.2k       | 2.0m     | 1x59x1x32,96x1x1x32,96    | 1x59x1x96    | Padding:Valid stride:1x1 activation:Relu              |
 | 11    | depthwise_conv_2d | 118.9k | 51.0k  | 90.4k      | 88.7k      | 1.6m     | 1x59x1x96,1x9x1x96,96     | 1x59x1x96    | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
 | 12    | conv_2d           | 364.4k | 181.2k | 145.7k     | 5.3k       | 1.9m     | 1x59x1x96,32x1x1x96,32    | 1x59x1x32    | Padding:Valid stride:1x1 activation:None              |
 | 13    | add               | 1.9k   | 0      | 6.6k       | 2.7k       | 120.0u   | 1x59x1x32,1x59x1x32       | 1x59x1x32    | Activation:Relu                                       |
 | 14    | conv_2d           | 379.5k | 181.2k | 154.0k     | 5.2k       | 2.0m     | 1x59x1x32,96x1x1x32,96    | 1x59x1x96    | Padding:Valid stride:1x1 activation:Relu              |
 | 15    | depthwise_conv_2d | 118.9k | 51.0k  | 90.4k      | 88.7k      | 1.6m     | 1x59x1x96,1x9x1x96,96     | 1x59x1x96    | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
 | 16    | conv_2d           | 364.4k | 181.2k | 145.7k     | 5.3k       | 1.9m     | 1x59x1x96,32x1x1x96,32    | 1x59x1x32    | Padding:Valid stride:1x1 activation:None              |
 | 17    | add               | 1.9k   | 0      | 6.6k       | 2.7k       | 120.0u   | 1x59x1x32,1x59x1x32       | 1x59x1x32    | Activation:Relu                                       |
 | 18    | conv_2d           | 379.5k | 181.2k | 154.5k     | 5.2k       | 2.0m     | 1x59x1x32,96x1x1x32,96    | 1x59x1x96    | Padding:Valid stride:1x1 activation:Relu              |
 | 19    | depthwise_conv_2d | 60.5k  | 25.9k  | 45.7k      | 45.6k      | 840.0u   | 1x59x1x96,1x9x1x96,96     | 1x30x1x96    | Multiplier:1 padding:Same stride:2x2 activation:Relu  |
 | 20    | conv_2d           | 185.3k | 92.2k  | 74.2k      | 5.3k       | 960.0u   | 1x30x1x96,32x1x1x96,32    | 1x30x1x32    | Padding:Valid stride:1x1 activation:None              |
 | 21    | conv_2d           | 64.3k  | 30.7k  | 27.4k      | 5.1k       | 390.0u   | 1x59x1x32,32x1x1x32,32    | 1x30x1x32    | Padding:Same stride:2x2 activation:Relu               |
 | 22    | add               | 960.0  | 0      | 3.4k       | 2.7k       | 90.0u    | 1x30x1x32,1x30x1x32       | 1x30x1x32    | Activation:Relu                                       |
 | 23    | conv_2d           | 193.0k | 92.2k  | 78.6k      | 5.2k       | 1.1m     | 1x30x1x32,96x1x1x32,96    | 1x30x1x96    | Padding:Valid stride:1x1 activation:Relu              |
 | 24    | depthwise_conv_2d | 60.5k  | 25.9k  | 44.6k      | 45.7k      | 840.0u   | 1x30x1x96,1x9x1x96,96     | 1x30x1x96    | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
 | 25    | conv_2d           | 185.3k | 92.2k  | 74.2k      | 5.3k       | 960.0u   | 1x30x1x96,32x1x1x96,32    | 1x30x1x32    | Padding:Valid stride:1x1 activation:None              |
 | 26    | add               | 960.0  | 0      | 3.4k       | 2.7k       | 90.0u    | 1x30x1x32,1x30x1x32       | 1x30x1x32    | Activation:Relu                                       |
 | 27    | conv_2d           | 193.0k | 92.2k  | 78.6k      | 5.2k       | 1.1m     | 1x30x1x32,96x1x1x32,96    | 1x30x1x96    | Padding:Valid stride:1x1 activation:Relu              |
 | 28    | depthwise_conv_2d | 60.5k  | 25.9k  | 44.6k      | 45.7k      | 810.0u   | 1x30x1x96,1x9x1x96,96     | 1x30x1x96    | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
 | 29    | conv_2d           | 185.3k | 92.2k  | 74.2k      | 5.3k       | 960.0u   | 1x30x1x96,32x1x1x96,32    | 1x30x1x32    | Padding:Valid stride:1x1 activation:None              |
 | 30    | add               | 960.0  | 0      | 3.4k       | 2.7k       | 90.0u    | 1x30x1x32,1x30x1x32       | 1x30x1x32    | Activation:Relu                                       |
 | 31    | conv_2d           | 193.0k | 92.2k  | 78.6k      | 5.2k       | 1.1m     | 1x30x1x32,96x1x1x32,96    | 1x30x1x96    | Padding:Valid stride:1x1 activation:Relu              |
 | 32    | depthwise_conv_2d | 60.5k  | 25.9k  | 44.6k      | 45.7k      | 810.0u   | 1x30x1x96,1x9x1x96,96     | 1x30x1x96    | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
 | 33    | conv_2d           | 185.3k | 92.2k  | 74.2k      | 5.3k       | 990.0u   | 1x30x1x96,32x1x1x96,32    | 1x30x1x32    | Padding:Valid stride:1x1 activation:None              |
 | 34    | add               | 960.0  | 0      | 3.4k       | 2.7k       | 90.0u    | 1x30x1x32,1x30x1x32       | 1x30x1x32    | Activation:Relu                                       |
 | 35    | conv_2d           | 193.0k | 92.2k  | 78.6k      | 5.2k       | 1.1m     | 1x30x1x32,96x1x1x32,96    | 1x30x1x96    | Padding:Valid stride:1x1 activation:Relu              |
 | 36    | depthwise_conv_2d | 30.2k  | 13.0k  | 22.3k      | 23.4k      | 420.0u   | 1x30x1x96,1x9x1x96,96     | 1x15x1x96    | Multiplier:1 padding:Same stride:2x2 activation:Relu  |
 | 37    | conv_2d           | 92.6k  | 46.1k  | 37.2k      | 5.2k       | 510.0u   | 1x15x1x96,32x1x1x96,32    | 1x15x1x32    | Padding:Valid stride:1x1 activation:None              |
 | 38    | conv_2d           | 32.2k  | 15.4k  | 13.7k      | 5.1k       | 240.0u   | 1x30x1x32,32x1x1x32,32    | 1x15x1x32    | Padding:Same stride:2x2 activation:Relu               |
 | 39    | add               | 480.0  | 0      | 1.7k       | 2.7k       | 60.0u    | 1x15x1x32,1x15x1x32       | 1x15x1x32    | Activation:Relu                                       |
 | 40    | conv_2d           | 96.5k  | 46.1k  | 39.4k      | 5.2k       | 570.0u   | 1x15x1x32,96x1x1x32,96    | 1x15x1x96    | Padding:Valid stride:1x1 activation:Relu              |
 | 41    | depthwise_conv_2d | 30.2k  | 13.0k  | 20.8k      | 23.4k      | 390.0u   | 1x15x1x96,1x9x1x96,96     | 1x15x1x96    | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
 | 42    | conv_2d           | 92.6k  | 46.1k  | 37.2k      | 5.2k       | 540.0u   | 1x15x1x96,32x1x1x96,32    | 1x15x1x32    | Padding:Valid stride:1x1 activation:None              |
 | 43    | add               | 480.0  | 0      | 1.7k       | 2.7k       | 60.0u    | 1x15x1x32,1x15x1x32       | 1x15x1x32    | Activation:Relu                                       |
 | 44    | conv_2d           | 96.5k  | 46.1k  | 39.4k      | 5.2k       | 540.0u   | 1x15x1x32,96x1x1x32,96    | 1x15x1x96    | Padding:Valid stride:1x1 activation:Relu              |
 | 45    | depthwise_conv_2d | 30.2k  | 13.0k  | 20.8k      | 23.4k      | 420.0u   | 1x15x1x96,1x9x1x96,96     | 1x15x1x96    | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
 | 46    | conv_2d           | 92.6k  | 46.1k  | 37.2k      | 5.2k       | 510.0u   | 1x15x1x96,32x1x1x96,32    | 1x15x1x32    | Padding:Valid stride:1x1 activation:None              |
 | 47    | add               | 480.0  | 0      | 1.7k       | 2.7k       | 30.0u    | 1x15x1x32,1x15x1x32       | 1x15x1x32    | Activation:Relu                                       |
 | 48    | conv_2d           | 96.5k  | 46.1k  | 39.4k      | 5.2k       | 570.0u   | 1x15x1x32,96x1x1x32,96    | 1x15x1x96    | Padding:Valid stride:1x1 activation:Relu              |
 | 49    | depthwise_conv_2d | 30.2k  | 13.0k  | 20.8k      | 23.4k      | 420.0u   | 1x15x1x96,1x9x1x96,96     | 1x15x1x96    | Multiplier:1 padding:Same stride:1x1 activation:Relu  |
 | 50    | conv_2d           | 92.6k  | 46.1k  | 37.2k      | 5.2k       | 510.0u   | 1x15x1x96,32x1x1x96,32    | 1x15x1x32    | Padding:Valid stride:1x1 activation:None              |
 | 51    | add               | 480.0  | 0      | 1.7k       | 2.7k       | 60.0u    | 1x15x1x32,1x15x1x32       | 1x15x1x32    | Activation:Relu                                       |
 | 52    | average_pool_2d   | 512.0  | 0      | 309.0      | 3.9k       | 60.0u    | 1x15x1x32                 | 1x1x1x32     | Padding:Valid stride:1x15 filter:1x15 activation:None |
 | 53    | reshape           | 0      | 0      | 0          | 595.0      | 0        | 1x1x1x32,2                | 1x32         | Type=none                                             |
 | 54    | fully_connected   | 130.0  | 64.0   | 123.0      | 2.1k       | 30.0u    | 1x32,2x32,2               | 1x2          | Activation:None                                       |
 | 55    | softmax           | 10.0   | 0      | 0          | 2.4k       | 30.0u    | 1x2                       | 1x2          | Type=softmaxoptions                                   |
 +-------+-------------------+--------+--------+------------+------------+----------+---------------------------+--------------+-------------------------------------------------------+

Model Diagram¶

mltk view keyword_spotting_alexa --tflite

Click to enlarge

Model Specification¶

# Import the Tensorflow packages
# required to build the model layout
import os
import math
from typing import Tuple, Dict, List

import numpy as np
import tensorflow as tf
import mltk.core as mltk_core

# Import the AudioFeatureGeneratorSettings which we'll configure
from mltk.core.preprocess.audio.audio_feature_generator import AudioFeatureGeneratorSettings
from mltk.core.preprocess.utils import tf_dataset as tf_dataset_utils
from mltk.core.preprocess.utils import audio as audio_utils
from mltk.core.preprocess.utils import image as image_utils
from mltk.core.keras.callbacks import SteppedLearnRateScheduler
from mltk.utils.path import create_user_dir
from mltk.core.preprocess.utils import (split_file_list, shuffle_file_list_by_group)
from mltk.utils.python import install_pip_package
from mltk.utils.archive_downloader import download_verify_extract, download_url
from mltk.models.shared import tenet



##########################################################################################
# Instantiate the MltkModel instance
#

# @mltk_model
class MyModel(
    mltk_core.MltkModel,    # We must inherit the MltkModel class
    mltk_core.TrainMixin,   # We also inherit the TrainMixin since we want to train this model
    mltk_core.DatasetMixin, # We also need the DatasetMixin mixin to provide the relevant dataset properties
    mltk_core.EvaluateClassifierMixin,  # While not required, also inherit EvaluateClassifierMixin to help will generating evaluation stats for our classification model
    mltk_core.SshMixin,
):
    pass
my_model = MyModel()

##########################################################################################
# General Settings

# For better tracking, the version should be incremented any time a non-trivial change is made
# NOTE: The version is optional and not used directly used by the MLTK
my_model.version = 1
# Provide a brief description about what this model models
# This description goes in the "description" field of the .tflite model file
my_model.description = 'Keyword spotting classifier to detect: "alexa"'


##########################################################################################
# Training Basic Settings

# This specifies the number of times we run the training.
# We just set this to a large value since we're using SteppedLearnRateScheduler
# to control when training completes
my_model.epochs = 9999
# Specify how many samples to pass through the model
# before updating the training gradients.
# Typical values are 10-64
# NOTE: Larger values require more memory and may not fit on your GPU
my_model.batch_size = 100


##########################################################################################
# Define the model architecture
#

def my_model_builder(model: MyModel) -> tf.keras.Model:
    """Build the "Teacher" Keras model
    """
    input_shape = model.input_shape
    # NOTE: This model requires the input shape: <time, 1, features>
    #       while the embedded device expects: <time, features, 1>
    #       Since the <time> axis is still row-major, we can swap the <features> with 1 without issue
    time_size, feature_size, _ = input_shape
    input_shape = (time_size, 1, feature_size)

    keras_model = tenet.TENet12(
        input_shape=input_shape,
        classes=model.n_classes
    )

    keras_model.compile(
        loss='categorical_crossentropy',
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001, epsilon=1e-8),
        metrics= ['accuracy']
    )

    return keras_model

my_model.build_model_function = my_model_builder
my_model.keras_custom_objects['MultiScaleTemporalConvolution'] = tenet.MultiScaleTemporalConvolution



##########################################################################################
# Training callback Settings
#


# The MLTK enables the tf.keras.callbacks.ModelCheckpoint by default.
my_model.checkpoint['monitor'] =  'val_accuracy'


# We use a custom learn rate schedule that is defined in:
# https://github.com/google-research/google-research/tree/master/kws_streaming
my_model.train_callbacks = [
    tf.keras.callbacks.TerminateOnNaN(),
    SteppedLearnRateScheduler([
        (100,   .001),
        (100,   .002),
        (100,   .003),
        (100,   .004),
        (30000, .005),
        (30000, .002),
        (20000, .0005),
        (10000, 1e-5),
        (5000,  1e-6),
        (5000,  1e-7),
    ] )
]


##########################################################################################
# Specify AudioFeatureGenerator Settings
# See https://siliconlabs.github.io/mltk/docs/audio/audio_feature_generator.html
#
frontend_settings = AudioFeatureGeneratorSettings()

frontend_settings.sample_rate_hz = 16000
frontend_settings.sample_length_ms = 1200                       # Use 1.2s audio clips to ensure the full "alexa" keyword is captured
frontend_settings.window_size_ms = 30
frontend_settings.window_step_ms = 10
frontend_settings.filterbank_n_channels = 108                   # We want this value to be as large as possible
                                                                # while still allowing for the ML model to execute efficiently on the hardware
frontend_settings.filterbank_upper_band_limit = 7500.0
frontend_settings.filterbank_lower_band_limit = 125.0           # The dev board mic seems to have a lot of noise at lower frequencies

frontend_settings.noise_reduction_enable = True                 # Enable the noise reduction block to help ignore background noise in the field
frontend_settings.noise_reduction_smoothing_bits = 10
frontend_settings.noise_reduction_even_smoothing =  0.025
frontend_settings.noise_reduction_odd_smoothing = 0.06
frontend_settings.noise_reduction_min_signal_remaining = 0.40   # This value is fairly large (which makes the background noise reduction small)
                                                                # But it has been found to still give good results
                                                                # i.e. There is still some background noise reduction,
                                                                # but the actual signal is still (mostly) untouched

frontend_settings.dc_notch_filter_enable = True                 # Enable the DC notch filter, to help remove the DC signal from the dev board's mic
frontend_settings.dc_notch_filter_coefficient = 0.95

frontend_settings.quantize_dynamic_scale_enable = True          # Enable dynamic quantization, this dynamically converts the uint16 spectrogram to int8
frontend_settings.quantize_dynamic_scale_range_db = 40.0


# Add the Audio Feature generator settings to the model parameters
# This way, they are included in the generated .tflite model file
# See https://siliconlabs.github.io/mltk/docs/guides/model_parameters.html
my_model.model_parameters.update(frontend_settings)


##########################################################################################
# Specify the other dataset settings
#

my_model.input_shape = frontend_settings.spectrogram_shape + (1,)

# Add the direction keywords plus a _unknown_ meta class
my_model.classes = ['alexa', '_unknown_']
unknown_class_id = my_model.classes.index('_unknown_')

# Ensure the class weights are balanced during training
# https://towardsdatascience.com/why-weight-the-importance-of-training-on-balanced-datasets-f1e54688e7df
my_model.class_weights = 'balanced'


##########################################################################################
# TF-Lite converter settings
#

my_model.tflite_converter['optimizations'] = [tf.lite.Optimize.DEFAULT]
my_model.tflite_converter['supported_ops'] = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
my_model.tflite_converter['inference_input_type'] = np.int8
my_model.tflite_converter['inference_output_type'] = np.int8
# Automatically generate a representative dataset from the validation data
my_model.tflite_converter['representative_dataset'] = 'generate'


validation_split = 0.10
unknown_class_multiplier = 1.5 # This controls how many more "unknown" samples there are relative to the "known" samples

# Uncomment this to dump the augmented audio samples to the log directory
# DO NOT forget to disable this before training the model as it will generate A LOT of data
data_dump_dir = my_model.create_log_dir('dataset_dump')

# This is the directory where the dataset will be extracted
dataset_dir = create_user_dir('datasets/alexa')


##########################################################################################
# Create the audio augmentation pipeline
#

# Install the other 3rd party packages required from preprocessing
install_pip_package('audiomentations')

import librosa
import audiomentations


def audio_pipeline_with_augmentations(
    path_batch:np.ndarray,
    label_batch:np.ndarray,
    unknown_samples_batch:np.ndarray,
    seed:np.ndarray
) -> np.ndarray:
    """Augment a batch of audio clips and generate spectrograms

    This does the following, for each audio file path in the input batch:
    1. Read audio file
    2. Adjust its length to fit within the specified length
    3. Apply random augmentations to the audio sample using audiomentations
    4. Convert to the specified sample rate (if necessary)
    5. Generate a spectrogram from the augmented audio sample
    6. Dump the augmented audio and spectrogram (if necessary)

    NOTE: This will be execute in parallel across *separate* subprocesses.

    Arguments:
        path_batch: Batch of audio file paths
        label_batch: Batch of corresponding labels
        unknown_samples_batch: Batch of randomly selected "unknown" sample file paths
        seed: Batch of seeds to use for random number generation,
            This ensures that the "random" augmentations are reproducible

    Return:
        Generated batch of spectrograms from augmented audio samples
    """
    batch_length = path_batch.shape[0]
    height, width = frontend_settings.spectrogram_shape
    x_shape = (batch_length, height, 1, width)
    x_batch = np.empty(x_shape, dtype=np.int8)

    # This is the amount of padding we add to the beginning of the sample
    # This allows for "warming up" the noise reduction block
    padding_length_ms = 1000
    padded_frontend_settings = frontend_settings.copy()
    padded_frontend_settings.sample_length_ms += padding_length_ms

    # For each audio sample path in the current batch
    for i, (audio_path, labels, unknown_sample) in enumerate(zip(path_batch, label_batch, unknown_samples_batch)):
        class_id = np.argmax(labels)
        np.random.seed(seed[i])

        rn = np.random.random()
        use_cropped_sample_as_unknown = False
        using_silence_as_unknown = False

        # 30% of the time we want to replace this sample
        # either either silence or a cropped "known" sample
        if class_id == unknown_class_id and rn < 0.15:
            # 1% of the time we want to replace an "unknown" sample with silence
            if rn < .08:
                using_silence_as_unknown = True
                original_sample_rate = frontend_settings.sample_rate_hz
                sample = np.zeros((original_sample_rate,), dtype=np.float32)
            else:
                # Otherwise, find a "known" sample in the current batch
                # Later, we'll crop this sample and use it as an "unknown" sample
                choices = list(range(batch_length))
                np.random.shuffle(choices)
                for choice_index in choices:
                    if np.argmax(label_batch[choice_index]) >= unknown_class_id:
                        continue
                    audio_path = path_batch[choice_index]
                    use_cropped_sample_as_unknown = True
                    break

        if not using_silence_as_unknown:
            if class_id == unknown_class_id:
                audio_path = unknown_sample

            # Read the audio file
            try:
                sample, original_sample_rate = audio_utils.read_audio_file(audio_path, return_numpy=True, return_sample_rate=True)
            except Exception as e:
                raise RuntimeError(f'Failed to read: {audio_path}, err: {e}')


        # Create a buffer to hold the padded sample
        padding_length = int((original_sample_rate * padding_length_ms) / 1000)
        padded_sample_length = int((original_sample_rate * padded_frontend_settings.sample_length_ms) / 1000)
        padded_sample = np.zeros((padded_sample_length,), dtype=np.float32)

        # If we want to crop a "known" sample and use it as an unknown sample
        if use_cropped_sample_as_unknown:
            # Trim any silence from the sample
            trimmed_sample, _ = librosa.effects.trim(sample, top_db=15)
            # Randomly insert 20% to 40% of the trimmed sample into padded sample buffer
            # Note that the entire trimmed sample is actually added to the padded sample buffer
            # However, only the part of the sample that is after padding_length_ms will actually be used.
            # Everything before will eventually be dropped
            trimmed_sample_length = min(len(trimmed_sample), padded_sample_length)
            cropped_sample_percent = np.random.uniform(.2, .5)
            cropped_sample_length = int(trimmed_sample_length * cropped_sample_percent)
            if cropped_sample_length > .100 * original_sample_rate:
                # Add the beginning of the sample to the end of the padded sample buffer.
                # This simulates the sample streaming into the audio buffer,
                # but not being fully streamed in when an inference is invoked on the device.
                # In this case, we want the partial sample to be considered "unknown".
                padded_sample[-cropped_sample_length:] += trimmed_sample[:cropped_sample_length]
        else:
             # Otherwise, adjust the audio clip to the length defined in the frontend_settings
            out_length = int((original_sample_rate * frontend_settings.sample_length_ms) / 1000)
            sample = audio_utils.adjust_length(
                sample,
                out_length=out_length,
                trim_threshold_db=30,
                offset=np.random.uniform(0, 1)
            )
            padded_sample[padding_length:padding_length+len(sample)] += sample



        # Initialize the global audio augmentations instance
        # NOTE: We want this to be global so that we only initialize it once per subprocess
        audio_augmentations = globals().get('audio_augmentations', None)
        if audio_augmentations is None:
            audio_augmentations = audiomentations.Compose(
                p=1.0,
                transforms=[
                audiomentations.Gain(min_gain_in_db=0.95, max_gain_in_db=1.5, p=1.0),
                audiomentations.AddBackgroundNoise(
                    f'{dataset_dir}/_background_noise_/brd2601',
                    min_absolute_rms_in_db=-75.0,
                    max_absolute_rms_in_db=-60.0,
                    noise_rms="absolute",
                    lru_cache_size=50,
                    p=1.0
                ),
                audiomentations.AddBackgroundNoise(
                    f'{dataset_dir}/_background_noise_/ambient',
                    min_snr_in_db=-2, # The lower the SNR, the louder the background noise
                    max_snr_in_db=35,
                    noise_rms="relative",
                    lru_cache_size=50,
                    p=0.95
                ),
                audiomentations.AddGaussianSNR(min_snr_in_db=30, max_snr_in_db=60, p=0.25),
            ])
            globals()['audio_augmentations'] = audio_augmentations

        # Apply random augmentations to the audio sample
        augmented_sample = audio_augmentations(padded_sample, original_sample_rate)

        # Convert the sample rate (if necessary)
        if original_sample_rate != frontend_settings.sample_rate_hz:
            augmented_sample = audio_utils.resample(
                augmented_sample,
                orig_sr=original_sample_rate,
                target_sr=frontend_settings.sample_rate_hz
            )

        # Ensure the sample values are within (-1,1)
        augmented_sample = np.clip(augmented_sample, -1.0, 1.0)

        # Generate a spectrogram from the augmented audio sample
        spectrogram = audio_utils.apply_frontend(
            sample=augmented_sample,
            settings=padded_frontend_settings,
            dtype=np.int8
        )

        # The input audio sample was padded with padding_length_ms of background noise
        # Drop the background noise from the final spectrogram used for training
        spectrogram = spectrogram[-height:, :]
        # The output spectrogram is 2D, add a channel dimension to make it 3D:
        # (height, width, channels=1)

        # Convert the spectrogram dimension from
        # <time, features> to
        # <time, 1, features>
        spectrogram = np.expand_dims(spectrogram, axis=-2)

        x_batch[i] = spectrogram

        # Dump the augmented audio sample AND corresponding spectrogram (if necessary)
        data_dump_dir = globals().get('data_dump_dir', None)
        if data_dump_dir:
            try:
                from cv2 import cv2
            except:
                import cv2

            fn = os.path.basename(audio_path.decode('utf-8'))
            audio_dump_path = f'{data_dump_dir}/{class_id}-{fn[:-4]}-{seed[0]}.wav'
            spectrogram_dumped = np.squeeze(spectrogram, axis=-2)
            # Transpose to put the time on the x-axis
            spectrogram_dumped = np.transpose(spectrogram_dumped)
            # Convert from int8 to uint8
            spectrogram_dumped = np.clip(spectrogram_dumped +128, 0, 255)
            spectrogram_dumped = spectrogram_dumped.astype(np.uint8)
            # Increase the size of the spectrogram to make it easier to see as a jpeg
            spectrogram_dumped = cv2.resize(spectrogram_dumped, (height*3,width*3))

            valid_sample_length = int((frontend_settings.sample_length_ms * frontend_settings.sample_rate_hz) / 1000)
            valid_augmented_sample = augmented_sample[-valid_sample_length:]
            audio_dump_path = audio_utils.write_audio_file(
                audio_dump_path,
                valid_augmented_sample,
                sample_rate=frontend_settings.sample_rate_hz
            )
            image_dump_path = audio_dump_path.replace('.wav', '.jpg')
            jpg_data = cv2.applyColorMap(spectrogram_dumped, cv2.COLORMAP_HOT)
            cv2.imwrite(image_dump_path, jpg_data)


    return x_batch


##########################################################################################
# Define the MltkDataset object
# NOTE: This class is optional but is useful for organizing the code
#
class MyDataset(mltk_core.MltkDataset):

    def __init__(self):
        super().__init__()
        self.pools = []
        self.all_unknown_samples = []
        self.summary = ''

    def summarize_dataset(self) -> str:
        """Return a string summary of the dataset"""
        s = self.summary
        s += mltk_core.MltkDataset.summarize_class_counts(my_model.class_counts)
        return s


    def load_dataset(
        self,
        subset: str,
        test:bool = False,
        **kwargs
    ) -> Tuple[tf.data.Dataset, None, tf.data.Dataset]:
        """Load the dataset subset

        This is called automatically by the MLTK before training
        or evaluation.

        Args:
            subset: The dataset subset to return: 'training' or 'evaluation'
            test: This is optional, it is used when invoking a training "dryrun", e.g.: mltk train audio_tf_dataset-test
                If this is true, then only return a small portion of the dataset for testing purposes

        Return:
            if subset == training:
                A tuple, (train_dataset, None, validation_dataset)
            else:
                validation_dataset
        """

        if subset == 'training':
            x = self.load_subset('training', test=test)
            validation_data = self.load_subset('validation', test=test)

            return x, None, validation_data

        else:
            x = self.load_subset('validation', test=test)
            return x

    def unload_dataset(self):
        """Unload the dataset by shutting down the processing pools"""
        for pool in self.pools:
            pool.shutdown()
        self.pools.clear()


    def load_subset(self, subset:str, test:bool) -> tf.data.Dataset:
        """Load the subset"""
        if subset in ('validation', 'evaluation'):
            split = (0, validation_split)
        elif subset == 'training':
            split = (validation_split, 1)
            data_dump_dir = globals().get('data_dump_dir', None)
            if data_dump_dir:
                print(f'\n\n*** Dumping augmented samples to: {data_dump_dir}\n\n')
        else:
            split = None
            my_model.class_counts = {}


        # Download the synthetic "alexa" dataset and extract into the dataset directory
        download_verify_extract(
            url='https://www.silabs.com/public/files/github/mltk/datasets/sl_synthetic_alexa.7z',
            dest_dir=dataset_dir,
            file_hash='e657e91d6ea55639ce2e9a4dd8994c112fda2de0',
            show_progress=False,
            remove_root_dir=False,
            clean_dest_dir=False
        )

        # Download the synthetic alexa "unknown" dataset and extract into the dataset sub-directory: '_unknown'
        download_verify_extract(
            url='https://www.silabs.com/public/files/github/mltk/datasets/sl_synthetic_alexa_unknown.7z',
            dest_dir=f'{dataset_dir}/_unknown',
            file_hash='2693e5fc72c52f199de2a69ed720644c2c363591',
            show_progress=False,
            remove_root_dir=False,
            clean_dest_dir=False
        )

        # Download the synthetic generic "unknown" dataset and extract into the dataset sub-directory: '_unknown'
        download_verify_extract(
            url='https://www.silabs.com/public/files/github/mltk/datasets/sl_synthetic_generic_unknown.7z',
            dest_dir=f'{dataset_dir}/_unknown',
            file_hash='6729b4763a506e427beb0909069219767f3d0d6f',
            show_progress=False,
            remove_root_dir=False,
            clean_dest_dir=False
        )

        # Download the mlcommons subset and extract into the dataset sub-directory: '_unknown/mlcommons_keywords'
        download_verify_extract(
            url='https://www.silabs.com/public/files/github/mltk/datasets/mlcommons_keywords_subset_part1.7z',
            dest_dir=f'{dataset_dir}/_unknown/mlcommons_keywords',
            file_hash='6f515d8247e2fee70cd0941420918c8fe57a31e8',
            show_progress=False,
            remove_root_dir=False,
            clean_dest_dir=False
        )

        # Download the mlcommons subset and extract into the dataset sub-directory: '_unknown/mlcommons_keywords'
        download_verify_extract(
            url='https://www.silabs.com/public/files/github/mltk/datasets/mlcommons_keywords_subset_part2.7z',
            dest_dir=f'{dataset_dir}/_unknown/mlcommons_keywords',
            file_hash='7816f5ffa1deeafa9b5b3faae563f44198031796',
            show_progress=False,
            remove_root_dir=False,
            clean_dest_dir=False
        )

        # Download the mlcommons voice and extract into the dataset sub-directory: '_unknown/mlcommons_voice'
        download_verify_extract(
            url='https://www.silabs.com/public/files/github/mltk/datasets/common_voice_subset.7z',
            dest_dir=f'{dataset_dir}/_unknown/mlcommons_voice',
            file_hash='ce424afd5d9b754f3ea6b3a4f78304f48e865f93',
            show_progress=False,
            remove_root_dir=False,
            clean_dest_dir=False
        )

        # Download the BRD2601 background microphone audio and add it to the _background_noise_/brd2601 of the dataset
        download_verify_extract(
            url='https://github.com/SiliconLabs/mltk_assets/raw/master/datasets/brd2601_background_audio.7z',
            dest_dir=f'{dataset_dir}/_background_noise_/brd2601',
            file_hash='3069A85002965A7830C660343C215EDD4FAE39C6',
            show_progress=False,
            remove_root_dir=False,
            clean_dest_dir=False,
        )


        # Download other ambient background audio and add it to the _background_noise_/ambient of the dataset
        # See https://mixkit.co/
        URLS = [
            'https://assets.mixkit.co/sfx/download/mixkit-very-crowded-pub-or-party-loop-360.wav',
            'https://assets.mixkit.co/sfx/download/mixkit-big-crowd-talking-loop-364.wav',
            'https://assets.mixkit.co/sfx/download/mixkit-restaurant-crowd-talking-ambience-444.wav',
            'https://assets.mixkit.co/sfx/download/mixkit-keyboard-typing-1386.wav',
            'https://assets.mixkit.co/sfx/download/mixkit-office-ambience-447.wav',
            'https://assets.mixkit.co/sfx/download/mixkit-hotel-lobby-with-dining-area-ambience-453.wav'
        ]

        for url in URLS:
            fn = os.path.basename(url)
            dst_path = f'{dataset_dir}/_background_noise_/ambient/{fn}'
            os.makedirs(os.path.dirname(dst_path), exist_ok=True)
            if not os.path.exists(dst_path):
                download_url(url=url, dst_path=dst_path)
                sample, original_sample_rate = audio_utils.read_audio_file(
                    dst_path,
                    return_sample_rate=True,
                    return_numpy=True
                )
                sample = audio_utils.resample(
                    sample,
                    orig_sr=original_sample_rate,
                    target_sr=frontend_settings.sample_rate_hz
                )
                audio_utils.write_audio_file(dst_path, sample, sample_rate=16000)


        # Create a tf.data.Dataset from the extracted dataset directory
        max_samples_per_class = my_model.batch_size if test else -1
        class_counts = my_model.class_counts[subset] if subset else my_model.class_counts
        features_ds, labels_ds = tf_dataset_utils.load_audio_directory(
            directory=dataset_dir,
            classes=my_model.classes,
            onehot_encode=True, # We're using categorical cross-entropy so one-hot encode the labels
            shuffle=True,
            seed=42,
            max_samples_per_class=max_samples_per_class,
            unknown_class_percentage=0, # We manually populate the "known" class in the add_unknown_samples() callback
            split=split,
            return_audio_data=False, # We only want to return the file paths
            class_counts=class_counts,
            list_valid_filenames_in_directory_function=self.list_valid_filenames_in_directory,
            process_samples_function=self.add_unknown_samples
        )

        # While training, the "unknown" class has a fixed size of samples
        # However, the actual number of "unknown" samples is much larger than the class size.
        # As such, we shuffle the unknown samples an randomly select from all of them while training.
        unknown_samples_ds = tf.data.Dataset.from_tensor_slices(self.all_unknown_samples)
        unknown_samples_ds = unknown_samples_ds.shuffle(max(len(self.all_unknown_samples), 10000), reshuffle_each_iteration=True)
        self.summary += f'{subset} subset shuffling {len(self.all_unknown_samples)} "unknown" samples\n'
        self.all_unknown_samples = []


        if subset:
            per_job_batch_multiplier = 1000
            per_job_batch_size = my_model.batch_size * per_job_batch_multiplier

            # We use an incrementing counter as the seed for the random augmentations
            # This helps to keep the training reproducible
            seed_counter = tf.data.experimental.Counter()
            features_ds = features_ds.zip((features_ds, labels_ds, unknown_samples_ds, seed_counter))

            # Usage of tf_dataset_utils.parallel_process()
            # is optional, but can speed-up training as the data augmentations
            # are spread across the available CPU cores.
            # Each CPU core gets its own subprocess,
            # and and subprocess executes audio_augmentation_pipeline() on batches of the dataset.

            features_ds = features_ds.batch(per_job_batch_size // per_job_batch_multiplier, drop_remainder=True)
            labels_ds = labels_ds.batch(per_job_batch_size // per_job_batch_multiplier, drop_remainder=True)
            features_ds, pool = tf_dataset_utils.parallel_process(
                features_ds,
                audio_pipeline_with_augmentations,
                dtype=np.int8,
                #n_jobs=84 if subset == 'training' else 32, # These are the settings for a 256 CPU core cloud machine
                n_jobs=72 if subset == 'training' else 32, # These are the settings for a 128 CPU core cloud machine
                #n_jobs=44 if subset == 'training' else 16, # These are the settings for a 96 CPU core cloud machine
                #n_jobs=50 if subset == 'training' else 25, # These are the settings for a 84 CPU core cloud machine
                #n_jobs=36 if subset == 'training' else 12, # These are the settings for a 64 CPU core cloud machine
                #n_jobs=28 if subset == 'training' else 16, # These are the settings for a 48 CPU core cloud machine
                #n_jobs=.65 if subset == 'training' else .35,
                #n_jobs=1,
                name=subset,
            )
            self.pools.append(pool)
            features_ds = features_ds.unbatch()
            labels_ds = labels_ds.unbatch()

            # Pre-fetching batches can help with throughput
            features_ds = features_ds.prefetch(per_job_batch_size)

        # Combine the augmented audio samples with their corresponding labels
        ds = tf.data.Dataset.zip((features_ds, labels_ds))

        # Shuffle the data for each sample
        # A perfect shuffle would use n_samples but this can slow down training,
        # so we just shuffle batches of the data
        #ds = ds.shuffle(n_samples, reshuffle_each_iteration=True)
        ds = ds.shuffle(per_job_batch_size, reshuffle_each_iteration=True)

        # At this point we have a flat dataset of x,y tuples
        # Batch the data as necessary for training
        ds = ds.batch(my_model.batch_size)

        # Pre-fetch a couple training batches to aid throughput
        ds = ds.prefetch(2)

        return ds

    def list_valid_filenames_in_directory(
        self,
        base_directory:str,
        search_class:str,
        white_list_formats:List[str],
        split:float,
        follow_links:bool,
        shuffle_index_directory:str
    ) -> Tuple[str, List[str]]:
        """Return a list of valid file names for the given class

        This is called by the tf_dataset_utils.load_audio_directory() API.

        # This uses shuffle_file_list_by_group() helper function so that the same "voices"
        # are only present in a particular subset.
        """
        assert shuffle_index_directory is None, 'Shuffling the index is not supported by this dataset'

        file_list = []
        index_path = f'{base_directory}/.index/{search_class}.txt'

        # If the index file exists, then read it
        if os.path.exists(index_path):
            with open(index_path, 'r') as f:
                for line in f:
                    file_list.append(line.strip())

        else:
            # Else find all files for the given class in the search directory
            class_base_dir = f'{base_directory}/{search_class}/'
            for root, _, files in os.walk(base_directory, followlinks=follow_links):
                root = root.replace('\\', '/') + '/'
                if not root.startswith(class_base_dir):
                    continue

                for fname in files:
                    if not fname.lower().endswith(white_list_formats):
                        continue
                    abs_path = os.path.join(root, fname)
                    if os.path.getsize(abs_path) == 0:
                        continue
                    rel_path = os.path.relpath(abs_path, base_directory)
                    file_list.append(rel_path.replace('\\', '/'))


                # Shuffle the voice groups
                # then flatten into list
                # This way, when the list is split into training and validation sets
                # the same voice only appears in one subset
                file_list = shuffle_file_list_by_group(file_list, get_sample_group_id_from_path)

                # Write the file list file
                mltk_core.get_mltk_logger().info(f'Generating index for "{search_class}" ({len(file_list)} samples): {index_path}')
                os.makedirs(os.path.dirname(index_path), exist_ok=True)
                with open(index_path, 'w') as f:
                    for p in file_list:
                        f.write(p + '\n')

        if len(file_list) == 0:
            raise RuntimeError(f'No samples found for class: {search_class}')


        n_files = len(file_list)
        if split[0] == 0:
            start = 0
            stop = math.ceil(split[1] * n_files)

            # We want to ensure the same person isn't in both subsets
            # So, ensure that the split point does NOT
            # split with file names with the same hash
            # recall: same hash = same person saying word

            # Get the hash of the other subset
            other_subset_hash = get_sample_group_id_from_path(file_list[stop])
            # Keep moving the 'stop' index back while
            # it's index matches the otherside
            while stop > 0 and get_sample_group_id_from_path(file_list[stop-1]) == other_subset_hash:
                stop -= 1

        else:
            start = math.ceil(split[0] * n_files)
            # Get the hash of the this subset
            this_subset_hash = get_sample_group_id_from_path(file_list[start])
            # Keep moving the 'start' index back while
            # it's index matches this side's
            while start > 0 and get_sample_group_id_from_path(file_list[start-1]) == this_subset_hash:
                start -= 1

            stop = n_files

        filenames = file_list[start:stop]

        return search_class, filenames

    def add_unknown_samples(
        self,
        directory:str,
        sample_paths:Dict[str,str], # A dictionary: <class name>, [<sample paths relative to directory>],
        split:Tuple[float,float],
        follow_links:bool,
        white_list_formats:List[str],
        shuffle:bool,
        seed:int,
        **kwargs
    ):
        """Generate a list of all possible "unknown" samples for this given subset.

        Then populate the "_unknown_" class with an empty list of length: unknown_class_multiplier * len(<alexa class>)
        The empty values will dynamically populated from randomly chosen values in the full "unknown" class list.

        """
        unknown_dir = f'{dataset_dir}/_unknown/unknown'
        mlcommons_keywords_dir = f'{dataset_dir}/_unknown/mlcommons_keywords'
        mlcommons_voice_dir = f'{dataset_dir}/_unknown/mlcommons_voice'

        # Create a list of all possible "unknown" samples
        file_list = list([f'_unknown/unknown/{x}' for x in os.listdir(unknown_dir) if x.endswith('.wav') and os.path.getsize(f'{unknown_dir}/{x}') > 0])

        # All all the mlcommons_keywords "unknown" samples that are not the "known" sample
        for kw in os.listdir(mlcommons_keywords_dir):
            if kw in my_model.classes:
                continue
            d = f'{mlcommons_keywords_dir}/{kw}'
            if not os.path.isdir(d):
                continue
            for fn in os.listdir(d):
                if fn.endswith('.wav'):
                    file_list.append(f'_unknown/mlcommons_keywords/{kw}/{fn}')

        # The ML commons voice dataset contain samples of people speaking sentences.
        # Determine how long each sample is and add it that many times to the file list
        # This way can randomly choice different parts of the sample
        for fn in os.listdir(mlcommons_voice_dir):
            if fn.endswith('.wav'):
                p = f'{mlcommons_voice_dir}/{fn}'
                multiplier = max(1, os.path.getsize(p) // (2 * 16000))
                for _ in range(multiplier):
                    file_list.append(f'_unknown/mlcommons_voice/{fn}')


        # Sort the unknown samples by "voice"
        # This helps to ensure voices are only present in a given subset
        file_list = sorted(file_list)
        file_list = shuffle_file_list_by_group(file_list, get_sample_group_id_from_path)

        # Split the file list for the current subset
        file_list = split_file_list(file_list, split)

        # Populate the "_unknown_" class with empty strings
        # The number of "_unknown_" entries is: <# of known samples> * unknown_class_multiplier
        # The empty strings are dynamically populated in audio_pipeline_with_augmentations()
        # with randomly selected "unknown" samples
        for key,value in sample_paths.items():
            if key != '_unknown_':
                sample_paths['_unknown_'] = [''] * int(len(value) * unknown_class_multiplier)
                break

        self.all_unknown_samples = [f'{directory}/{x}' for x in file_list]



def get_sample_group_id_from_path(p:str) -> str:
    """Extract the "voice hash" from the sample path.

    """
    fn = os.path.basename(p)
    fn = fn.replace('.wav', '').replace('.mp3', '')

    # If this sample is from the Google speech commands dataset
    #  c53b335a_nohash_1.wav -> c53b335a
    if '_nohash_' in fn:
        toks = fn.split('_')
        return toks[0]

    # If this sample is from an mlcommons dataset
    #  common_voice_en_20127845.wav -> 20127845
    if fn.startswith('common_voice_'):
        toks = fn.split('_')
        return toks[-1]

    # If this sample is from a silabs synthetic dataset
    # azure_af-ZA+AdriNeural+None+aww+medium+low+588b6ace.wav -> 588b6ace
    if fn.startswith(('gcp_', 'azure_', 'aws_')):
        toks = fn.split('+')
        return toks[-1]


    raise RuntimeError(f'Failed to get voice hash from {p}')


my_model.dataset = MyDataset()




#################################################
# Audio Classifier Settings
#
# These are additional parameters to include in
# the generated .tflite model file.
# The settings are used by the ble_audio_classifier app
# NOTE: Corresponding command-line options will override these values.

# This the amount of time in milliseconds between audio processing loops
# Since we're using the audio detection block, we want this to be as short as possible
my_model.model_parameters['latency_ms'] = 200
# The minimum number of inference results to average when calculating the detection value
my_model.model_parameters['minimum_count'] = 2
# Controls the smoothing.
# Drop all inference results that are older than <now> minus window_duration
# Longer durations (in milliseconds) will give a higher confidence that the results are correct, but may miss some commands
my_model.model_parameters['average_window_duration_ms'] = int(my_model.model_parameters['latency_ms']*my_model.model_parameters['minimum_count']*1.1)
# Define a specific detection threshold for each class
my_model.model_parameters['detection_threshold'] = int(.80*255)
# Amount of milliseconds to wait after a keyword is detected before detecting the SAME keyword again
# A different keyword may be detected immediately after
my_model.model_parameters['suppression_ms'] = 900
# Set the volume gain scaler (i.e. amplitude) to apply to the microphone data. If 0 or omitted, no scaler is applied
my_model.model_parameters['volume_gain'] = 0
# Enable verbose inference results
my_model.model_parameters['verbose_model_output_logs'] = False
# Uncomment this to increase the baud rate
# NOTE: You must use Simplicity Studio to increase the baud rate on the dev board as well
#my_model.model_parameters['baud_rate'] = 460800

##########################################################################################
# The following allows for running this model training script directly, e.g.:
# python keyword_spotting_alexa.py
#
# Note that this has the same functionality as:
# mltk train keyword_spotting_alexa
#
if __name__ == '__main__':
    from mltk import cli

    # Setup the CLI logger
    cli.get_logger(verbose=True)


    # If this is true then this will do a "dry run" of the model testing
    # If this is false, then the model will be fully trained
    test_mode_enabled = True

    # Train the model
    # This does the same as issuing the command:  mltk train keyword_spotting_alexa-test --clean)
    train_results = mltk_core.train_model(my_model, clean=True, test=test_mode_enabled)
    print(train_results)

    # Evaluate the model against the quantized .h5 (i.e. float32) model
    # This does the same as issuing the command: mltk evaluate keyword_spotting_alexa-test
    tflite_eval_results = mltk_core.evaluate_model(my_model, verbose=True, test=test_mode_enabled)
    print(tflite_eval_results)

    # Profile the model in the simulator
    # This does the same as issuing the command: mltk profile keyword_spotting_alexa-test
    profiling_results = mltk_core.profile_model(my_model, test=test_mode_enabled)
    print(profiling_results)