keyword_spotting_pacman_v3¶
This model is a CNN classifier to detect the keywords:
left
right
up
down
stop
go
It is specially trained to handle the background noise generated by the video game “Pac-Man”.
This model specification script is designed to work with the Keyword Spotting Pac-Man tutorial.
Source code: keyword_spotting_pacman_v3.py
Pre-trained model: keyword_spotting_pacman_v3.mltk.zip
Dataset¶
This model was trained using several different datasets:
mltk.datasets.audio.direction_commands - Synthetically generated keywords: left, right, up, down, stop, go
mltk.datasets.audio.speech_commands_v2 - Human generated keywords: left, right, up, down, stop, go
mltk.datasets.audio.mlcommons.ml_commons_keyword - Large collection of keywords, random subset used for unknown class
mltk.datasets.audio.background_noise.esc50 - Collection of various noises, random subset used for unknown class
mltk.datasets.audio.background_noise.ambient - Collection of various background noises, mixed into other samples for augmentation
mltk.datasets.audio.background_noise.brd2601 - “Silence” recorded by BRD2601 microphone, mixed into other samples to make them “sound” like they came from the BRD2601’s microphone
Pac-Man game noise - Recording from Pac-Man game play, mixed into other samples for augmentation
Hint
Uncomment the line:
#data_dump_dir = my_model.create_log_dir('dataset_dump')
To dump the augmented audio samples and corresponding spectrograms. This is useful to see how the augmentations affect the samples during training. WARNING: This will generate A LOT of file dumps, so be sure to disable during actual model training.
Dataset Summary¶
Dataset subset: training, found 62865 samples:
left: 8382
right: 8382
up: 8382
down: 8382
stop: 8382
go: 8382
_unknown_: 12573
Dataset subset: validation, found 6435 samples:
left: 858
right: 858
up: 858
down: 858
stop: 858
go: 858
_unknown_: 1287
Preprocessing¶
The audio samples are converted to a spectrogram using the mltk.core.preprocess.audio.audio_feature_generator.AudioFeatureGenerator
.
The following setting are used:
sample_rate: 16kHz
sample_length: 1s
window size: 30ms
window step: 10ms
n_channels: 104
upper_band_limit: 7500.0
lower_band_limit:125.0
noise_reduction_enable: True
noise_reduction_min_signal_remaining: 0.40
dc_notch_filter_enable: True
dc_notch_filter_coefficient: 0.95
quantize_dynamic_scale_enable: True
quantize_dynamic_scale_range_db: 40.0
Model Architecture¶
The model is based on the Temporal efficient neural network (TENet) model architecture.
A network for processing spectrogram data using temporal and depthwise convolutions. The network treats the [T, F] spectrogram as a timeseries shaped [T, 1, F].
More details at mltk.models.shared.tenet.TENet
Model Summary¶
mltk summarize keyword_spotting_pacman_v3 --tflite
+-------+-------------------+-----------------+-----------------+------------------------------------------------------+
| Index | OpCode | Input(s) | Output(s) | Config |
+-------+-------------------+-----------------+-----------------+------------------------------------------------------+
| 0 | conv_2d | 98x1x104 (int8) | 98x1x40 (int8) | Padding:Same stride:1x1 activation:None |
| | | 3x1x104 (int8) | | |
| | | 40 (int32) | | |
| 1 | conv_2d | 98x1x40 (int8) | 98x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 2 | depthwise_conv_2d | 98x1x120 (int8) | 49x1x120 (int8) | Multiplier:1 padding:Same stride:2x2 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 3 | conv_2d | 49x1x120 (int8) | 49x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 4 | conv_2d | 98x1x40 (int8) | 49x1x40 (int8) | Padding:Same stride:2x2 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 40 (int32) | | |
| 5 | add | 49x1x40 (int8) | 49x1x40 (int8) | Activation:Relu |
| | | 49x1x40 (int8) | | |
| 6 | conv_2d | 49x1x40 (int8) | 49x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 7 | depthwise_conv_2d | 49x1x120 (int8) | 49x1x120 (int8) | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 8 | conv_2d | 49x1x120 (int8) | 49x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 9 | add | 49x1x40 (int8) | 49x1x40 (int8) | Activation:Relu |
| | | 49x1x40 (int8) | | |
| 10 | conv_2d | 49x1x40 (int8) | 49x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 11 | depthwise_conv_2d | 49x1x120 (int8) | 49x1x120 (int8) | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 12 | conv_2d | 49x1x120 (int8) | 49x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 13 | add | 49x1x40 (int8) | 49x1x40 (int8) | Activation:Relu |
| | | 49x1x40 (int8) | | |
| 14 | conv_2d | 49x1x40 (int8) | 49x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 15 | depthwise_conv_2d | 49x1x120 (int8) | 49x1x120 (int8) | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 16 | conv_2d | 49x1x120 (int8) | 49x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 17 | add | 49x1x40 (int8) | 49x1x40 (int8) | Activation:Relu |
| | | 49x1x40 (int8) | | |
| 18 | conv_2d | 49x1x40 (int8) | 49x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 19 | depthwise_conv_2d | 49x1x120 (int8) | 25x1x120 (int8) | Multiplier:1 padding:Same stride:2x2 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 20 | conv_2d | 25x1x120 (int8) | 25x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 21 | conv_2d | 49x1x40 (int8) | 25x1x40 (int8) | Padding:Same stride:2x2 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 40 (int32) | | |
| 22 | add | 25x1x40 (int8) | 25x1x40 (int8) | Activation:Relu |
| | | 25x1x40 (int8) | | |
| 23 | conv_2d | 25x1x40 (int8) | 25x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 24 | depthwise_conv_2d | 25x1x120 (int8) | 25x1x120 (int8) | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 25 | conv_2d | 25x1x120 (int8) | 25x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 26 | add | 25x1x40 (int8) | 25x1x40 (int8) | Activation:Relu |
| | | 25x1x40 (int8) | | |
| 27 | conv_2d | 25x1x40 (int8) | 25x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 28 | depthwise_conv_2d | 25x1x120 (int8) | 25x1x120 (int8) | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 29 | conv_2d | 25x1x120 (int8) | 25x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 30 | add | 25x1x40 (int8) | 25x1x40 (int8) | Activation:Relu |
| | | 25x1x40 (int8) | | |
| 31 | conv_2d | 25x1x40 (int8) | 25x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 32 | depthwise_conv_2d | 25x1x120 (int8) | 25x1x120 (int8) | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 33 | conv_2d | 25x1x120 (int8) | 25x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 34 | add | 25x1x40 (int8) | 25x1x40 (int8) | Activation:Relu |
| | | 25x1x40 (int8) | | |
| 35 | conv_2d | 25x1x40 (int8) | 25x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 36 | depthwise_conv_2d | 25x1x120 (int8) | 13x1x120 (int8) | Multiplier:1 padding:Same stride:2x2 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 37 | conv_2d | 13x1x120 (int8) | 13x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 38 | conv_2d | 25x1x40 (int8) | 13x1x40 (int8) | Padding:Same stride:2x2 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 40 (int32) | | |
| 39 | add | 13x1x40 (int8) | 13x1x40 (int8) | Activation:Relu |
| | | 13x1x40 (int8) | | |
| 40 | conv_2d | 13x1x40 (int8) | 13x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 41 | depthwise_conv_2d | 13x1x120 (int8) | 13x1x120 (int8) | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 42 | conv_2d | 13x1x120 (int8) | 13x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 43 | add | 13x1x40 (int8) | 13x1x40 (int8) | Activation:Relu |
| | | 13x1x40 (int8) | | |
| 44 | conv_2d | 13x1x40 (int8) | 13x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 45 | depthwise_conv_2d | 13x1x120 (int8) | 13x1x120 (int8) | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 46 | conv_2d | 13x1x120 (int8) | 13x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 47 | add | 13x1x40 (int8) | 13x1x40 (int8) | Activation:Relu |
| | | 13x1x40 (int8) | | |
| 48 | conv_2d | 13x1x40 (int8) | 13x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 49 | depthwise_conv_2d | 13x1x120 (int8) | 13x1x120 (int8) | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 50 | conv_2d | 13x1x120 (int8) | 13x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 51 | add | 13x1x40 (int8) | 13x1x40 (int8) | Activation:Relu |
| | | 13x1x40 (int8) | | |
| 52 | conv_2d | 13x1x40 (int8) | 13x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 53 | depthwise_conv_2d | 13x1x120 (int8) | 7x1x120 (int8) | Multiplier:1 padding:Same stride:2x2 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 54 | conv_2d | 7x1x120 (int8) | 7x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 55 | conv_2d | 13x1x40 (int8) | 7x1x40 (int8) | Padding:Same stride:2x2 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 40 (int32) | | |
| 56 | add | 7x1x40 (int8) | 7x1x40 (int8) | Activation:Relu |
| | | 7x1x40 (int8) | | |
| 57 | conv_2d | 7x1x40 (int8) | 7x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 58 | depthwise_conv_2d | 7x1x120 (int8) | 7x1x120 (int8) | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 59 | conv_2d | 7x1x120 (int8) | 7x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 60 | add | 7x1x40 (int8) | 7x1x40 (int8) | Activation:Relu |
| | | 7x1x40 (int8) | | |
| 61 | conv_2d | 7x1x40 (int8) | 7x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 62 | depthwise_conv_2d | 7x1x120 (int8) | 7x1x120 (int8) | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 63 | conv_2d | 7x1x120 (int8) | 7x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 64 | add | 7x1x40 (int8) | 7x1x40 (int8) | Activation:Relu |
| | | 7x1x40 (int8) | | |
| 65 | conv_2d | 7x1x40 (int8) | 7x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 66 | depthwise_conv_2d | 7x1x120 (int8) | 7x1x120 (int8) | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 67 | conv_2d | 7x1x120 (int8) | 7x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 68 | add | 7x1x40 (int8) | 7x1x40 (int8) | Activation:Relu |
| | | 7x1x40 (int8) | | |
| 69 | conv_2d | 7x1x40 (int8) | 7x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 70 | depthwise_conv_2d | 7x1x120 (int8) | 4x1x120 (int8) | Multiplier:1 padding:Same stride:2x2 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 71 | conv_2d | 4x1x120 (int8) | 4x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 72 | conv_2d | 7x1x40 (int8) | 4x1x40 (int8) | Padding:Same stride:2x2 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 40 (int32) | | |
| 73 | add | 4x1x40 (int8) | 4x1x40 (int8) | Activation:Relu |
| | | 4x1x40 (int8) | | |
| 74 | conv_2d | 4x1x40 (int8) | 4x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 75 | depthwise_conv_2d | 4x1x120 (int8) | 4x1x120 (int8) | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 76 | conv_2d | 4x1x120 (int8) | 4x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 77 | add | 4x1x40 (int8) | 4x1x40 (int8) | Activation:Relu |
| | | 4x1x40 (int8) | | |
| 78 | conv_2d | 4x1x40 (int8) | 4x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 79 | depthwise_conv_2d | 4x1x120 (int8) | 4x1x120 (int8) | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 80 | conv_2d | 4x1x120 (int8) | 4x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 81 | add | 4x1x40 (int8) | 4x1x40 (int8) | Activation:Relu |
| | | 4x1x40 (int8) | | |
| 82 | conv_2d | 4x1x40 (int8) | 4x1x120 (int8) | Padding:Valid stride:1x1 activation:Relu |
| | | 1x1x40 (int8) | | |
| | | 120 (int32) | | |
| 83 | depthwise_conv_2d | 4x1x120 (int8) | 4x1x120 (int8) | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| | | 9x1x120 (int8) | | |
| | | 120 (int32) | | |
| 84 | conv_2d | 4x1x120 (int8) | 4x1x40 (int8) | Padding:Valid stride:1x1 activation:None |
| | | 1x1x120 (int8) | | |
| | | 40 (int32) | | |
| 85 | add | 4x1x40 (int8) | 4x1x40 (int8) | Activation:Relu |
| | | 4x1x40 (int8) | | |
| 86 | average_pool_2d | 4x1x40 (int8) | 1x1x40 (int8) | Padding:Valid stride:1x4 filter:1x4 activation:None |
| 87 | reshape | 1x1x40 (int8) | 40 (int8) | Type=none |
| | | 2 (int32) | | |
| 88 | fully_connected | 40 (int8) | 7 (int8) | Activation:None |
| | | 40 (int8) | | |
| | | 7 (int32) | | |
| 89 | softmax | 7 (int8) | 7 (int8) | Type=softmaxoptions |
+-------+-------------------+-----------------+-----------------+------------------------------------------------------+
Total MACs: 6.018 M
Total OPs: 12.181 M
Name: keyword_spotting_pacman_v3
Version: 1
Description: Keyword spotting classifier to detect: left, right, up, down, stop, go with Pac-Man video game background noise
Classes: left, right, up, down, stop, go, _unknown_
Runtime memory size (RAM): 78.804 k
hash: fa5f9b8ea1c47d942d537ef3702ffa76
date: 2023-03-01T20:30:54.940Z
fe.sample_rate_hz: 16000
fe.fft_length: 512
fe.sample_length_ms: 1000
fe.window_size_ms: 30
fe.window_step_ms: 10
fe.filterbank_n_channels: 104
fe.filterbank_upper_band_limit: 7500.0
fe.filterbank_lower_band_limit: 125.0
fe.noise_reduction_enable: True
fe.noise_reduction_smoothing_bits: 10
fe.noise_reduction_even_smoothing: 0.02500000037252903
fe.noise_reduction_odd_smoothing: 0.05999999865889549
fe.noise_reduction_min_signal_remaining: 0.4000000059604645
fe.pcan_enable: False
fe.pcan_strength: 0.949999988079071
fe.pcan_offset: 80.0
fe.pcan_gain_bits: 21
fe.log_scale_enable: True
fe.log_scale_shift: 6
fe.activity_detection_enable: False
fe.activity_detection_alpha_a: 0.5
fe.activity_detection_alpha_b: 0.800000011920929
fe.activity_detection_arm_threshold: 0.75
fe.activity_detection_trip_threshold: 0.800000011920929
fe.dc_notch_filter_enable: True
fe.dc_notch_filter_coefficient: 0.949999988079071
fe.quantize_dynamic_scale_enable: True
fe.quantize_dynamic_scale_range_db: 40.0
average_window_duration_ms: 300
detection_threshold_list: [216, 216, 234, 234, 252, 252, 255]
suppression_ms: 700
minimum_count: 2
volume_gain: 0.0
latency_ms: 10
verbose_model_output_logs: False
.tflite file size: 446.6kB
Model Profiling Report¶
# Profile on physical EFR32xG24 using MVP accelerator
mltk profile keyword_spotting_pacman_v3 --device --accelerator MVP
Profiling Summary
Name: keyword_spotting_pacman_v3
Accelerator: MVP
Input Shape: 1x98x1x104
Input Data Type: int8
Output Shape: 1x7
Output Data Type: int8
Flash, Model File Size (bytes): 446.5k
RAM, Runtime Memory Size (bytes): 76.7k
Operation Count: 12.4M
Multiply-Accumulate Count: 6.0M
Layer Count: 90
Unsupported Layer Count: 0
Accelerator Cycle Count: 5.3M
CPU Cycle Count: 953.0k
CPU Utilization (%): 16.6
Clock Rate (hz): 78.0M
Time (s): 73.4m
Ops/s: 168.9M
MACs/s: 81.9M
Inference/s: 13.6
Model Layers
+-------+-------------------+--------+--------+------------+------------+----------+--------------------------+--------------+------------------------------------------------------+
| Index | OpCode | # Ops | # MACs | Acc Cycles | CPU Cycles | Time (s) | Input Shape | Output Shape | Options |
+-------+-------------------+--------+--------+------------+------------+----------+--------------------------+--------------+------------------------------------------------------+
| 0 | conv_2d | 2.5M | 1.2M | 928.9k | 11.3k | 11.8m | 1x98x1x104,40x3x1x104,40 | 1x98x1x40 | Padding:Same stride:1x1 activation:None |
| 1 | conv_2d | 976.1k | 470.4k | 390.2k | 5.2k | 5.0m | 1x98x1x40,120x1x1x40,120 | 1x98x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 2 | depthwise_conv_2d | 123.5k | 52.9k | 96.4k | 78.7k | 1.6m | 1x98x1x120,1x9x1x120,120 | 1x49x1x120 | Multiplier:1 padding:Same stride:2x2 activation:Relu |
| 3 | conv_2d | 472.4k | 235.2k | 185.3k | 5.3k | 2.4m | 1x49x1x120,40x1x1x120,40 | 1x49x1x40 | Padding:Valid stride:1x1 activation:None |
| 4 | conv_2d | 162.7k | 78.4k | 66.7k | 5.2k | 870.0u | 1x98x1x40,40x1x1x40,40 | 1x49x1x40 | Padding:Same stride:2x2 activation:Relu |
| 5 | add | 2.0k | 0 | 6.9k | 2.7k | 120.0u | 1x49x1x40,1x49x1x40 | 1x49x1x40 | Activation:Relu |
| 6 | conv_2d | 488.0k | 235.2k | 195.2k | 5.3k | 2.5m | 1x49x1x40,120x1x1x40,120 | 1x49x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 7 | depthwise_conv_2d | 123.5k | 52.9k | 94.5k | 78.5k | 1.6m | 1x49x1x120,1x9x1x120,120 | 1x49x1x120 | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| 8 | conv_2d | 472.4k | 235.2k | 185.3k | 5.3k | 2.4m | 1x49x1x120,40x1x1x120,40 | 1x49x1x40 | Padding:Valid stride:1x1 activation:None |
| 9 | add | 2.0k | 0 | 6.9k | 2.6k | 120.0u | 1x49x1x40,1x49x1x40 | 1x49x1x40 | Activation:Relu |
| 10 | conv_2d | 488.0k | 235.2k | 195.2k | 5.3k | 2.5m | 1x49x1x40,120x1x1x40,120 | 1x49x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 11 | depthwise_conv_2d | 123.5k | 52.9k | 94.5k | 78.5k | 1.6m | 1x49x1x120,1x9x1x120,120 | 1x49x1x120 | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| 12 | conv_2d | 472.4k | 235.2k | 185.3k | 5.3k | 2.4m | 1x49x1x120,40x1x1x120,40 | 1x49x1x40 | Padding:Valid stride:1x1 activation:None |
| 13 | add | 2.0k | 0 | 6.9k | 2.6k | 120.0u | 1x49x1x40,1x49x1x40 | 1x49x1x40 | Activation:Relu |
| 14 | conv_2d | 488.0k | 235.2k | 195.2k | 5.3k | 2.5m | 1x49x1x40,120x1x1x40,120 | 1x49x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 15 | depthwise_conv_2d | 123.5k | 52.9k | 94.5k | 78.5k | 1.6m | 1x49x1x120,1x9x1x120,120 | 1x49x1x120 | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| 16 | conv_2d | 472.4k | 235.2k | 185.3k | 5.3k | 2.4m | 1x49x1x120,40x1x1x120,40 | 1x49x1x40 | Padding:Valid stride:1x1 activation:None |
| 17 | add | 2.0k | 0 | 6.9k | 2.6k | 120.0u | 1x49x1x40,1x49x1x40 | 1x49x1x40 | Activation:Relu |
| 18 | conv_2d | 488.0k | 235.2k | 195.5k | 5.3k | 2.5m | 1x49x1x40,120x1x1x40,120 | 1x49x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 19 | depthwise_conv_2d | 63.0k | 27.0k | 47.9k | 40.6k | 810.0u | 1x49x1x120,1x9x1x120,120 | 1x25x1x120 | Multiplier:1 padding:Same stride:2x2 activation:Relu |
| 20 | conv_2d | 241.0k | 120.0k | 95.3k | 5.3k | 1.3m | 1x25x1x120,40x1x1x120,40 | 1x25x1x40 | Padding:Valid stride:1x1 activation:None |
| 21 | conv_2d | 83.0k | 40.0k | 34.4k | 5.2k | 480.0u | 1x49x1x40,40x1x1x40,40 | 1x25x1x40 | Padding:Same stride:2x2 activation:Relu |
| 22 | add | 1.0k | 0 | 3.5k | 2.6k | 90.0u | 1x25x1x40,1x25x1x40 | 1x25x1x40 | Activation:Relu |
| 23 | conv_2d | 249.0k | 120.0k | 99.9k | 5.3k | 1.3m | 1x25x1x40,120x1x1x40,120 | 1x25x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 24 | depthwise_conv_2d | 63.0k | 27.0k | 46.4k | 40.5k | 810.0u | 1x25x1x120,1x9x1x120,120 | 1x25x1x120 | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| 25 | conv_2d | 241.0k | 120.0k | 95.3k | 5.3k | 1.2m | 1x25x1x120,40x1x1x120,40 | 1x25x1x40 | Padding:Valid stride:1x1 activation:None |
| 26 | add | 1.0k | 0 | 3.5k | 2.6k | 90.0u | 1x25x1x40,1x25x1x40 | 1x25x1x40 | Activation:Relu |
| 27 | conv_2d | 249.0k | 120.0k | 99.7k | 5.3k | 1.3m | 1x25x1x40,120x1x1x40,120 | 1x25x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 28 | depthwise_conv_2d | 63.0k | 27.0k | 46.4k | 40.5k | 810.0u | 1x25x1x120,1x9x1x120,120 | 1x25x1x120 | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| 29 | conv_2d | 241.0k | 120.0k | 95.3k | 5.3k | 1.3m | 1x25x1x120,40x1x1x120,40 | 1x25x1x40 | Padding:Valid stride:1x1 activation:None |
| 30 | add | 1.0k | 0 | 3.5k | 2.6k | 60.0u | 1x25x1x40,1x25x1x40 | 1x25x1x40 | Activation:Relu |
| 31 | conv_2d | 249.0k | 120.0k | 99.9k | 5.3k | 1.3m | 1x25x1x40,120x1x1x40,120 | 1x25x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 32 | depthwise_conv_2d | 63.0k | 27.0k | 46.4k | 40.5k | 810.0u | 1x25x1x120,1x9x1x120,120 | 1x25x1x120 | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| 33 | conv_2d | 241.0k | 120.0k | 95.3k | 5.3k | 1.3m | 1x25x1x120,40x1x1x120,40 | 1x25x1x40 | Padding:Valid stride:1x1 activation:None |
| 34 | add | 1.0k | 0 | 3.5k | 2.6k | 90.0u | 1x25x1x40,1x25x1x40 | 1x25x1x40 | Activation:Relu |
| 35 | conv_2d | 249.0k | 120.0k | 99.9k | 5.3k | 1.3m | 1x25x1x40,120x1x1x40,120 | 1x25x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 36 | depthwise_conv_2d | 32.8k | 14.0k | 23.8k | 21.6k | 420.0u | 1x25x1x120,1x9x1x120,120 | 1x13x1x120 | Multiplier:1 padding:Same stride:2x2 activation:Relu |
| 37 | conv_2d | 125.3k | 62.4k | 49.6k | 5.3k | 690.0u | 1x13x1x120,40x1x1x120,40 | 1x13x1x40 | Padding:Valid stride:1x1 activation:None |
| 38 | conv_2d | 43.2k | 20.8k | 18.0k | 5.2k | 270.0u | 1x25x1x40,40x1x1x40,40 | 1x13x1x40 | Padding:Same stride:2x2 activation:Relu |
| 39 | add | 520.0 | 0 | 1.8k | 2.6k | 60.0u | 1x13x1x40,1x13x1x40 | 1x13x1x40 | Activation:Relu |
| 40 | conv_2d | 129.5k | 62.4k | 52.0k | 5.3k | 720.0u | 1x13x1x40,120x1x1x40,120 | 1x13x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 41 | depthwise_conv_2d | 32.8k | 14.0k | 22.4k | 21.6k | 420.0u | 1x13x1x120,1x9x1x120,120 | 1x13x1x120 | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| 42 | conv_2d | 125.3k | 62.4k | 49.6k | 5.3k | 660.0u | 1x13x1x120,40x1x1x120,40 | 1x13x1x40 | Padding:Valid stride:1x1 activation:None |
| 43 | add | 520.0 | 0 | 1.8k | 2.6k | 30.0u | 1x13x1x40,1x13x1x40 | 1x13x1x40 | Activation:Relu |
| 44 | conv_2d | 129.5k | 62.4k | 52.0k | 5.3k | 690.0u | 1x13x1x40,120x1x1x40,120 | 1x13x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 45 | depthwise_conv_2d | 32.8k | 14.0k | 22.4k | 21.6k | 390.0u | 1x13x1x120,1x9x1x120,120 | 1x13x1x120 | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| 46 | conv_2d | 125.3k | 62.4k | 49.6k | 5.3k | 690.0u | 1x13x1x120,40x1x1x120,40 | 1x13x1x40 | Padding:Valid stride:1x1 activation:None |
| 47 | add | 520.0 | 0 | 1.8k | 2.6k | 60.0u | 1x13x1x40,1x13x1x40 | 1x13x1x40 | Activation:Relu |
| 48 | conv_2d | 129.5k | 62.4k | 52.0k | 5.3k | 720.0u | 1x13x1x40,120x1x1x40,120 | 1x13x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 49 | depthwise_conv_2d | 32.8k | 14.0k | 22.4k | 21.6k | 420.0u | 1x13x1x120,1x9x1x120,120 | 1x13x1x120 | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| 50 | conv_2d | 125.3k | 62.4k | 49.6k | 5.3k | 660.0u | 1x13x1x120,40x1x1x120,40 | 1x13x1x40 | Padding:Valid stride:1x1 activation:None |
| 51 | add | 520.0 | 0 | 1.8k | 2.6k | 60.0u | 1x13x1x40,1x13x1x40 | 1x13x1x40 | Activation:Relu |
| 52 | conv_2d | 129.5k | 62.4k | 52.0k | 5.3k | 720.0u | 1x13x1x40,120x1x1x40,120 | 1x13x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 53 | depthwise_conv_2d | 17.6k | 7.6k | 11.8k | 12.1k | 240.0u | 1x13x1x120,1x9x1x120,120 | 1x7x1x120 | Multiplier:1 padding:Same stride:2x2 activation:Relu |
| 54 | conv_2d | 67.5k | 33.6k | 26.7k | 5.3k | 390.0u | 1x7x1x120,40x1x1x120,40 | 1x7x1x40 | Padding:Valid stride:1x1 activation:None |
| 55 | conv_2d | 23.2k | 11.2k | 9.7k | 5.2k | 210.0u | 1x13x1x40,40x1x1x40,40 | 1x7x1x40 | Padding:Same stride:2x2 activation:Relu |
| 56 | add | 280.0 | 0 | 992.0 | 2.6k | 30.0u | 1x7x1x40,1x7x1x40 | 1x7x1x40 | Activation:Relu |
| 57 | conv_2d | 69.7k | 33.6k | 28.1k | 5.3k | 420.0u | 1x7x1x40,120x1x1x40,120 | 1x7x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 58 | depthwise_conv_2d | 17.6k | 7.6k | 10.3k | 12.1k | 210.0u | 1x7x1x120,1x9x1x120,120 | 1x7x1x120 | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| 59 | conv_2d | 67.5k | 33.6k | 26.7k | 5.3k | 420.0u | 1x7x1x120,40x1x1x120,40 | 1x7x1x40 | Padding:Valid stride:1x1 activation:None |
| 60 | add | 280.0 | 0 | 992.0 | 2.6k | 30.0u | 1x7x1x40,1x7x1x40 | 1x7x1x40 | Activation:Relu |
| 61 | conv_2d | 69.7k | 33.6k | 28.1k | 5.3k | 420.0u | 1x7x1x40,120x1x1x40,120 | 1x7x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 62 | depthwise_conv_2d | 17.6k | 7.6k | 10.3k | 12.1k | 210.0u | 1x7x1x120,1x9x1x120,120 | 1x7x1x120 | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| 63 | conv_2d | 67.5k | 33.6k | 26.7k | 5.3k | 420.0u | 1x7x1x120,40x1x1x120,40 | 1x7x1x40 | Padding:Valid stride:1x1 activation:None |
| 64 | add | 280.0 | 0 | 992.0 | 2.6k | 30.0u | 1x7x1x40,1x7x1x40 | 1x7x1x40 | Activation:Relu |
| 65 | conv_2d | 69.7k | 33.6k | 28.1k | 5.3k | 420.0u | 1x7x1x40,120x1x1x40,120 | 1x7x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 66 | depthwise_conv_2d | 17.6k | 7.6k | 10.3k | 12.1k | 210.0u | 1x7x1x120,1x9x1x120,120 | 1x7x1x120 | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| 67 | conv_2d | 67.5k | 33.6k | 26.7k | 5.3k | 420.0u | 1x7x1x120,40x1x1x120,40 | 1x7x1x40 | Padding:Valid stride:1x1 activation:None |
| 68 | add | 280.0 | 0 | 992.0 | 2.6k | 30.0u | 1x7x1x40,1x7x1x40 | 1x7x1x40 | Activation:Relu |
| 69 | conv_2d | 69.7k | 33.6k | 28.1k | 5.3k | 420.0u | 1x7x1x40,120x1x1x40,120 | 1x7x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 70 | depthwise_conv_2d | 10.1k | 4.3k | 5.8k | 7.4k | 150.0u | 1x7x1x120,1x9x1x120,120 | 1x4x1x120 | Multiplier:1 padding:Same stride:2x2 activation:Relu |
| 71 | conv_2d | 38.6k | 19.2k | 15.3k | 5.3k | 240.0u | 1x4x1x120,40x1x1x120,40 | 1x4x1x40 | Padding:Valid stride:1x1 activation:None |
| 72 | conv_2d | 13.3k | 6.4k | 5.6k | 5.2k | 150.0u | 1x7x1x40,40x1x1x40,40 | 1x4x1x40 | Padding:Same stride:2x2 activation:Relu |
| 73 | add | 160.0 | 0 | 572.0 | 2.6k | 30.0u | 1x4x1x40,1x4x1x40 | 1x4x1x40 | Activation:Relu |
| 74 | conv_2d | 39.8k | 19.2k | 16.1k | 5.3k | 270.0u | 1x4x1x40,120x1x1x40,120 | 1x4x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 75 | depthwise_conv_2d | 10.1k | 4.3k | 4.3k | 7.3k | 120.0u | 1x4x1x120,1x9x1x120,120 | 1x4x1x120 | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| 76 | conv_2d | 38.6k | 19.2k | 15.3k | 5.3k | 270.0u | 1x4x1x120,40x1x1x120,40 | 1x4x1x40 | Padding:Valid stride:1x1 activation:None |
| 77 | add | 160.0 | 0 | 572.0 | 2.6k | 30.0u | 1x4x1x40,1x4x1x40 | 1x4x1x40 | Activation:Relu |
| 78 | conv_2d | 39.8k | 19.2k | 16.1k | 5.3k | 270.0u | 1x4x1x40,120x1x1x40,120 | 1x4x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 79 | depthwise_conv_2d | 10.1k | 4.3k | 4.3k | 7.3k | 120.0u | 1x4x1x120,1x9x1x120,120 | 1x4x1x120 | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| 80 | conv_2d | 38.6k | 19.2k | 15.3k | 5.3k | 270.0u | 1x4x1x120,40x1x1x120,40 | 1x4x1x40 | Padding:Valid stride:1x1 activation:None |
| 81 | add | 160.0 | 0 | 572.0 | 2.6k | 30.0u | 1x4x1x40,1x4x1x40 | 1x4x1x40 | Activation:Relu |
| 82 | conv_2d | 39.8k | 19.2k | 16.1k | 5.3k | 270.0u | 1x4x1x40,120x1x1x40,120 | 1x4x1x120 | Padding:Valid stride:1x1 activation:Relu |
| 83 | depthwise_conv_2d | 10.1k | 4.3k | 4.3k | 7.3k | 120.0u | 1x4x1x120,1x9x1x120,120 | 1x4x1x120 | Multiplier:1 padding:Same stride:1x1 activation:Relu |
| 84 | conv_2d | 38.6k | 19.2k | 15.3k | 5.3k | 240.0u | 1x4x1x120,40x1x1x120,40 | 1x4x1x40 | Padding:Valid stride:1x1 activation:None |
| 85 | add | 160.0 | 0 | 572.0 | 2.6k | 30.0u | 1x4x1x40,1x4x1x40 | 1x4x1x40 | Activation:Relu |
| 86 | average_pool_2d | 200.0 | 0 | 154.0 | 3.8k | 60.0u | 1x4x1x40 | 1x1x1x40 | Padding:Valid stride:1x4 filter:1x4 activation:None |
| 87 | reshape | 0 | 0 | 0 | 640.0 | 0 | 1x1x1x40,2 | 1x40 | Type=none |
| 88 | fully_connected | 567.0 | 280.0 | 477.0 | 2.1k | 30.0u | 1x40,7x40,7 | 1x7 | Activation:None |
| 89 | softmax | 35.0 | 0 | 0 | 4.3k | 60.0u | 1x7 | 1x7 | Type=softmaxoptions |
+-------+-------------------+--------+--------+------------+------------+----------+--------------------------+--------------+------------------------------------------------------+
Model Diagram¶
mltk view keyword_spotting_pacman_v3 --tflite
Commands¶
# Do a "dry run" test training of the model
mltk train keyword_spotting_pacman_v3-test
# Train the model
mltk train keyword_spotting_pacman_v3
# Evaluate the trained model .tflite model
mltk evaluate keyword_spotting_pacman_v3 --tflite
# Profile the model in the MVP hardware accelerator simulator
mltk profile keyword_spotting_pacman_v3 --accelerator MVP --estimates
# Profile the model on a physical development board
mltk profile keyword_spotting_pacman_v3 --accelerator MVP --device
# Run the model in the audio classifier on the local PC
mltk classify_audio keyword_spotting_pacman_v3 --verbose
# Run the model in the audio classifier on the physical device
mltk classify_audio keyword_spotting_pacman_v3 --device --verbose --accelerator MVP
Model Specification¶
# Import the Tensorflow packages
# required to build the model layout
import os
import math
from typing import Tuple, Dict, List
import numpy as np
from numpy.random import RandomState
import tensorflow as tf
import mltk.core as mltk_core
# Import the AudioFeatureGeneratorSettings which we'll configure
from mltk.core.preprocess.audio.audio_feature_generator import AudioFeatureGeneratorSettings
from mltk.core.preprocess.utils import tf_dataset as tf_dataset_utils
from mltk.core.preprocess.utils import audio as audio_utils
from mltk.core.preprocess.utils import image as image_utils
from mltk.core.keras.callbacks import SteppedLearnRateScheduler
from mltk.utils.path import create_user_dir
from mltk.core.preprocess.utils import (split_file_list, shuffle_file_list_by_group)
from mltk.utils.python import install_pip_package
from mltk.models.shared import tenet
from mltk.datasets import audio as audio_datasets
##########################################################################################
# Instantiate the MltkModel instance
#
# @mltk_model
class MyModel(
mltk_core.MltkModel, # We must inherit the MltkModel class
mltk_core.TrainMixin, # We also inherit the TrainMixin since we want to train this model
mltk_core.DatasetMixin, # We also need the DatasetMixin mixin to provide the relevant dataset properties
mltk_core.EvaluateClassifierMixin, # While not required, also inherit EvaluateClassifierMixin to help will generating evaluation stats for our classification model
mltk_core.WeightsAndBiasesMixin # This allows for posting model info to https://wandb.ai
):
pass
my_model = MyModel()
##########################################################################################
# General Settings
# For better tracking, the version should be incremented any time a non-trivial change is made
# NOTE: The version is optional and not used directly used by the MLTK
my_model.version = 1
# Provide a brief description about what this model models
# This description goes in the "description" field of the .tflite model file
my_model.description = 'Keyword spotting classifier to detect: left, right, up, down, stop, go with Pac-Man video game background noise'
##########################################################################################
# Training Basic Settings
# This specifies the number of times we run the training.
# We just set this to a large value since we're using SteppedLearnRateScheduler
# to control when training completes
my_model.epochs = 9999
# Specify how many samples to pass through the model
# before updating the training gradients.
# Typical values are 10-64
# NOTE: Larger values require more memory and may not fit on your GPU
my_model.batch_size = 100
##########################################################################################
# Define the model architecture
#
def my_model_builder(model: MyModel) -> tf.keras.Model:
"""Build the Keras model
"""
input_shape = model.input_shape
# NOTE: This model requires the input shape: <time, 1, features>
# while the embedded device expects: <time, features, 1>
# Since the <time> axis is still row-major, we can swap the <features> with 1 without issue
time_size, feature_size, _ = input_shape
input_shape = (time_size, 1, feature_size)
keras_model = tenet.TENet12(
input_shape=input_shape,
classes=model.n_classes,
channels=40,
blocks=5,
)
keras_model.compile(
loss='categorical_crossentropy',
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001, epsilon=1e-8),
metrics= ['accuracy']
)
return keras_model
my_model.build_model_function = my_model_builder
# TENet uses a custom layer, be sure to add it to the keras_custom_objects
# so that we can load the corresponding .h5 model file
my_model.keras_custom_objects['MultiScaleTemporalConvolution'] = tenet.MultiScaleTemporalConvolution
##########################################################################################
# Training callback Settings
#
# The MLTK enables the tf.keras.callbacks.ModelCheckpoint by default.
my_model.checkpoint['monitor'] = 'val_accuracy'
# We use a custom learn rate schedule that is defined in:
# https://github.com/google-research/google-research/tree/master/kws_streaming
my_model.train_callbacks = [
tf.keras.callbacks.TerminateOnNaN(),
SteppedLearnRateScheduler([
(100, .001),
(100, .002),
(100, .003),
(100, .004),
(10000, .005),
(10000, .002),
(5000, .0005),
(5000, 1e-5),
(5000, 1e-6),
(5000, 1e-7),
] )
]
##########################################################################################
# Specify AudioFeatureGenerator Settings
# See https://siliconlabs.github.io/mltk/docs/audio/audio_feature_generator.html
#
frontend_settings = AudioFeatureGeneratorSettings()
frontend_settings.sample_rate_hz = 16000
frontend_settings.sample_length_ms = 1000 # A 1s buffer should be enough to capture the keywords
frontend_settings.window_size_ms = 30
frontend_settings.window_step_ms = 10
frontend_settings.filterbank_n_channels = 104 # We want this value to be as large as possible
# while still allowing for the ML model to execute efficiently on the hardware
frontend_settings.filterbank_upper_band_limit = 7500.0
frontend_settings.filterbank_lower_band_limit = 125.0 # The dev board mic seems to have a lot of noise at lower frequencies
frontend_settings.noise_reduction_enable = True # Enable the noise reduction block to help ignore background noise in the field
frontend_settings.noise_reduction_smoothing_bits = 10
frontend_settings.noise_reduction_even_smoothing = 0.025
frontend_settings.noise_reduction_odd_smoothing = 0.06
frontend_settings.noise_reduction_min_signal_remaining = 0.40 # This value is fairly large (which makes the background noise reduction small)
# But it has been found to still give good results
# i.e. There is still some background noise reduction,
# but the actual signal is still (mostly) untouched
frontend_settings.dc_notch_filter_enable = True # Enable the DC notch filter, to help remove the DC signal from the dev board's mic
frontend_settings.dc_notch_filter_coefficient = 0.95
frontend_settings.quantize_dynamic_scale_enable = True # Enable dynamic quantization, this dynamically converts the uint16 spectrogram to int8
frontend_settings.quantize_dynamic_scale_range_db = 40.0
# Add the Audio Feature generator settings to the model parameters
# This way, they are included in the generated .tflite model file
# See https://siliconlabs.github.io/mltk/docs/guides/model_parameters.html
my_model.model_parameters.update(frontend_settings)
##########################################################################################
# Specify the other dataset settings
#
my_model.input_shape = frontend_settings.spectrogram_shape + (1,)
# Add the direction keywords plus a _unknown_ meta class
my_model.classes = ['left','right','up','down','stop','go', '_unknown_']
unknown_class_id = my_model.classes.index('_unknown_')
# Ensure the class weights are balanced during training
# https://towardsdatascience.com/why-weight-the-importance-of-training-on-balanced-datasets-f1e54688e7df
my_model.class_weights = 'balanced'
##########################################################################################
# TF-Lite converter settings
#
my_model.tflite_converter['optimizations'] = [tf.lite.Optimize.DEFAULT]
my_model.tflite_converter['supported_ops'] = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
my_model.tflite_converter['inference_input_type'] = np.int8
my_model.tflite_converter['inference_output_type'] = np.int8
# Automatically generate a representative dataset from the validation data
my_model.tflite_converter['representative_dataset'] = 'generate'
validation_split = 0.15
# Uncomment this to dump the augmented audio samples to the log directory
# DO NOT forget to disable this before training the model as it will generate A LOT of data
#data_dump_dir = my_model.create_log_dir('dataset_dump')
# This is the directory where the dataset will be extracted
dataset_dir = create_user_dir('datasets/direction_commands')
##########################################################################################
# Create the audio augmentation pipeline
#
# Install the other 3rd party packages required from preprocessing
install_pip_package('audiomentations')
import librosa
import audiomentations
def audio_pipeline_with_augmentations(
path_batch:np.ndarray,
label_batch:np.ndarray,
seed:np.ndarray
) -> np.ndarray:
"""Augment a batch of audio clips and generate spectrograms
This does the following, for each audio file path in the input batch:
1. Read audio file
2. Adjust its length to fit within the specified length
3. Apply random augmentations to the audio sample using audiomentations
4. Convert to the specified sample rate (if necessary)
5. Generate a spectrogram from the augmented audio sample
6. Dump the augmented audio and spectrogram (if necessary)
NOTE: This will be execute in parallel across *separate* subprocesses.
Arguments:
path_batch: Batch of audio file paths
label_batch: Batch of corresponding labels
seed: Batch of seeds to use for random number generation,
This ensures that the "random" augmentations are reproducible
Return:
Generated batch of spectrograms from augmented audio samples
"""
batch_length = path_batch.shape[0]
height, width = frontend_settings.spectrogram_shape
x_shape = (batch_length, height, 1, width)
x_batch = np.empty(x_shape, dtype=np.int8)
# This is the amount of padding we add to the beginning of the sample
# This allows for "warming up" the noise reduction block
padding_length_ms = 1000
padded_frontend_settings = frontend_settings.copy()
padded_frontend_settings.sample_length_ms += padding_length_ms
# For each audio sample path in the current batch
for i, (audio_path, labels) in enumerate(zip(path_batch, label_batch)):
class_id = np.argmax(labels)
np.random.seed(seed[i])
rn = np.random.random()
# 3% of the time we want to replace the "unknown" sample with silence
if class_id == unknown_class_id and rn < 0.03:
original_sample_rate = frontend_settings.sample_rate_hz
sample = np.zeros((original_sample_rate,), dtype=np.float32)
audio_path = 'silence.wav'.encode('utf-8')
else:
# Read the audio file
try:
sample, original_sample_rate = audio_utils.read_audio_file(audio_path, return_numpy=True, return_sample_rate=True)
except Exception as e:
raise RuntimeError(f'Failed to read: {audio_path}, err: {e}')
# Create a buffer to hold the padded sample
padding_length = int((original_sample_rate * padding_length_ms) / 1000)
padded_sample_length = int((original_sample_rate * padded_frontend_settings.sample_length_ms) / 1000)
padded_sample = np.zeros((padded_sample_length,), dtype=np.float32)
# Adjust the audio clip to the length defined in the frontend_settings
out_length = int((original_sample_rate * frontend_settings.sample_length_ms) / 1000)
sample = audio_utils.adjust_length(
sample,
out_length=out_length,
trim_threshold_db=30,
offset=np.random.uniform(0, 1)
)
padded_sample[padding_length:padding_length+len(sample)] += sample
# Initialize the global audio augmentations instance
# NOTE: We want this to be global so that we only initialize it once per subprocess
audio_augmentations = globals().get('audio_augmentations', None)
if audio_augmentations is None:
audio_augmentations = audiomentations.Compose(
p=1.0,
transforms=[
audiomentations.Gain(min_gain_in_db=0.95, max_gain_in_db=1.2, p=1.0),
audiomentations.AddBackgroundNoise(
f'{dataset_dir}/_background_noise_/ambient',
min_snr_in_db=-1, # The lower the SNR, the louder the background noise
max_snr_in_db=35,
noise_rms="relative",
lru_cache_size=50,
p=0.80
),
audiomentations.AddBackgroundNoise(
f'{dataset_dir}/_background_noise_/pacman',
min_absolute_rms_in_db=-60,
max_absolute_rms_in_db=-35,
noise_rms="absolute",
lru_cache_size=50,
p=0.50
),
audiomentations.AddBackgroundNoise(
f'{dataset_dir}/_background_noise_/brd2601',
min_absolute_rms_in_db=-75.0,
max_absolute_rms_in_db=-60.0,
noise_rms="absolute",
lru_cache_size=50,
p=1.0
),
#audiomentations.AddGaussianSNR(min_snr_in_db=25, max_snr_in_db=40, p=0.25),
])
globals()['audio_augmentations'] = audio_augmentations
# Apply random augmentations to the audio sample
augmented_sample = audio_augmentations(padded_sample, original_sample_rate)
# Convert the sample rate (if necessary)
if original_sample_rate != frontend_settings.sample_rate_hz:
augmented_sample = audio_utils.resample(
augmented_sample,
orig_sr=original_sample_rate,
target_sr=frontend_settings.sample_rate_hz
)
# Ensure the sample values are within (-1,1)
augmented_sample = np.clip(augmented_sample, -1.0, 1.0)
# Generate a spectrogram from the augmented audio sample
spectrogram = audio_utils.apply_frontend(
sample=augmented_sample,
settings=padded_frontend_settings,
dtype=np.int8
)
# The input audio sample was padded with padding_length_ms of background noise
# Drop the padded background noise from the final spectrogram used for training
spectrogram = spectrogram[-height:, :]
# The output spectrogram is 2D, add a channel dimension to make it 3D:
# (height, width, channels=1)
# Convert the spectrogram dimension from
# <time, features> to
# <time, 1, features>
spectrogram = np.expand_dims(spectrogram, axis=-2)
x_batch[i] = spectrogram
# Dump the augmented audio sample AND corresponding spectrogram (if necessary)
data_dump_dir = globals().get('data_dump_dir', None)
if data_dump_dir:
try:
from cv2 import cv2
except:
import cv2
fn = os.path.basename(audio_path.decode('utf-8'))
audio_dump_path = f'{data_dump_dir}/{class_id}-{fn[:-4]}-{seed[0]}.wav'
spectrogram_dumped = np.squeeze(spectrogram, axis=-2)
# Transpose to put the time on the x-axis
spectrogram_dumped = np.transpose(spectrogram_dumped)
# Convert from int8 to uint8
spectrogram_dumped = np.clip(spectrogram_dumped +128, 0, 255)
spectrogram_dumped = spectrogram_dumped.astype(np.uint8)
# Increase the size of the spectrogram to make it easier to see as a jpeg
spectrogram_dumped = cv2.resize(spectrogram_dumped, (height*3,width*3))
valid_sample_length = int((frontend_settings.sample_length_ms * frontend_settings.sample_rate_hz) / 1000)
valid_augmented_sample = augmented_sample[-valid_sample_length:]
audio_dump_path = audio_utils.write_audio_file(
audio_dump_path,
valid_augmented_sample,
sample_rate=frontend_settings.sample_rate_hz
)
image_dump_path = audio_dump_path.replace('.wav', '.jpg')
jpg_data = cv2.applyColorMap(spectrogram_dumped, cv2.COLORMAP_HOT)
cv2.imwrite(image_dump_path, jpg_data)
return x_batch
##########################################################################################
# Define the MltkDataset object
# NOTE: This class is optional but is useful for organizing the code
#
class MyDataset(mltk_core.MltkDataset):
def __init__(self):
super().__init__()
self.pools = []
self.summary = ''
def summarize_dataset(self) -> str:
"""Return a string summary of the dataset"""
s = self.summary
s += mltk_core.MltkDataset.summarize_class_counts(my_model.class_counts)
return s
def load_dataset(
self,
subset: str,
test:bool = False,
**kwargs
) -> Tuple[tf.data.Dataset, None, tf.data.Dataset]:
"""Load the dataset subset
This is called automatically by the MLTK before training
or evaluation.
Args:
subset: The dataset subset to return: 'training' or 'evaluation'
test: This is optional, it is used when invoking a training "dryrun", e.g.: mltk train audio_tf_dataset-test
If this is true, then only return a small portion of the dataset for testing purposes
Return:
if subset == training:
A tuple, (train_dataset, None, validation_dataset)
else:
validation_dataset
"""
if subset == 'training':
x = self.load_subset('training', test=test)
validation_data = self.load_subset('validation', test=test)
return x, None, validation_data
else:
x = self.load_subset('validation', test=test)
return x
def unload_dataset(self):
"""Unload the dataset by shutting down the processing pools"""
for pool in self.pools:
pool.shutdown()
self.pools.clear()
def load_subset(self, subset:str, test:bool) -> tf.data.Dataset:
"""Load the subset"""
if subset in ('validation', 'evaluation'):
split = (0, validation_split)
elif subset == 'training':
split = (validation_split, 1)
data_dump_dir = globals().get('data_dump_dir', None)
if data_dump_dir:
print(f'\n\n*** Dumping augmented samples to: {data_dump_dir}\n\n')
else:
split = None
my_model.class_counts = {}
# Download the synthetic "direction_commands" dataset and extract into the dataset directory
audio_datasets.direction_commands.download(dataset_dir, clean_dest_dir=True)
# Download the Google speech commands dataset into the direction_commands dataset directory
# This effectively combines the two datasets
audio_datasets.speech_commands_v2.load_clean_data(dataset_dir, clean_dest_dir=False)
# Download the mlcommons subset and extract into the dataset sub-directory: '_unknown/mlcommons_keywords'
audio_datasets.mlcommons.ml_commons_keywords.download(f'{dataset_dir}/_unknown/mlcommons_keywords')
# Download the mlcommons ESC-50 dataset and extract into the dataset sub-directory: '_unknown/esc-50'
audio_datasets.background_noise.esc50.download(f'{dataset_dir}/_unknown/esc-50')
# Download the Pac-Man video game audio and add it to the _background_noise_/pacman of the dataset
audio_datasets.utils.download_and_extract(
dest_dir=f'{dataset_dir}/_background_noise_/pacman',
urls=(
'https://github.com/SiliconLabs/mltk_assets/raw/master/datasets/recorded_pacman_game_play.7z',
'749F552BC2ABA11E618969D8B0F6E5BDD62AC7A2'
)
)
# Download the BRD2601 background microphone audio and add it to the _background_noise_/brd2601 of the dataset
audio_datasets.background_noise.brd2601.download(f'{dataset_dir}/_background_noise_/brd2601')
# Download other ambient background audio and add it to the _background_noise_/ambient of the dataset
audio_datasets.background_noise.ambient.download(
f'{dataset_dir}/_background_noise_/ambient',
sample_rate_hertz=frontend_settings.sample_rate_hz
)
# Create a tf.data.Dataset from the extracted dataset directory
max_samples_per_class = my_model.batch_size if test else -1
class_counts = my_model.class_counts[subset] if subset else my_model.class_counts
features_ds, labels_ds = tf_dataset_utils.load_audio_directory(
directory=dataset_dir,
classes=my_model.classes,
onehot_encode=True, # We're using categorical cross-entropy so one-hot encode the labels
shuffle=True,
seed=42,
max_samples_per_class=max_samples_per_class,
unknown_class_percentage=0, # We manually populate the "known" class in the add_unknown_samples() callback
split=split,
return_audio_data=False, # We only want to return the file paths
class_counts=class_counts,
list_valid_filenames_in_directory_function=self.list_valid_filenames_in_directory,
process_samples_function=self.add_unknown_samples
)
if subset:
# The number of batches to process in each subprocess
per_job_batch_multiplier = 1000
per_job_batch_size = my_model.batch_size * per_job_batch_multiplier
# We use an incrementing counter as the seed for the random augmentations
# This helps to keep the training reproducible
try:
seed_counter = tf.data.Dataset.counter()
except:
seed_counter = tf.data.experimental.Counter()
features_ds = features_ds.zip((features_ds, labels_ds, seed_counter))
# Usage of tf_dataset_utils.parallel_process()
# is optional, but can speed-up training as the data augmentations
# are spread across the available CPU cores.
# Each CPU core gets its own subprocess,
# and and subprocess executes audio_augmentation_pipeline() on batches of the dataset.
features_ds = features_ds.batch(per_job_batch_size // per_job_batch_multiplier, drop_remainder=True)
labels_ds = labels_ds.batch(per_job_batch_size // per_job_batch_multiplier, drop_remainder=True)
features_ds, pool = tf_dataset_utils.parallel_process(
features_ds,
audio_pipeline_with_augmentations,
dtype=np.int8,
#n_jobs=84 if subset == 'training' else 32, # These are the settings for a 256 CPU core cloud machine
#n_jobs=72 if subset == 'training' else 32, # These are the settings for a 128 CPU core cloud machine
#n_jobs=44 if subset == 'training' else 16, # These are the settings for a 96 CPU core cloud machine
#n_jobs=50 if subset == 'training' else 25, # These are the settings for a 84 CPU core cloud machine
#n_jobs=36 if subset == 'training' else 12, # These are the settings for a 64 CPU core cloud machine
#n_jobs=28 if subset == 'training' else 16, # These are the settings for a 48 CPU core cloud machine
#n_jobs=.65 if subset == 'training' else .35,
n_jobs=8,
name=subset,
)
self.pools.append(pool)
features_ds = features_ds.unbatch()
labels_ds = labels_ds.unbatch()
# Pre-fetching batches can help with throughput
features_ds = features_ds.prefetch(per_job_batch_size)
# Combine the augmented audio samples with their corresponding labels
ds = tf.data.Dataset.zip((features_ds, labels_ds))
# Shuffle the data for each sample
# A perfect shuffle would use n_samples but this can slow down training,
# so we just shuffle batches of the data
#ds = ds.shuffle(n_samples, reshuffle_each_iteration=True)
if not test:
ds = ds.shuffle(per_job_batch_size, reshuffle_each_iteration=True)
# At this point we have a flat dataset of x,y tuples
# Batch the data as necessary for training
ds = ds.batch(my_model.batch_size)
# Pre-fetch a couple training batches to aid throughput
ds = ds.prefetch(2)
return ds
def list_valid_filenames_in_directory(
self,
base_directory:str,
search_class:str,
white_list_formats:List[str],
split:float,
follow_links:bool,
shuffle_index_directory:str
) -> Tuple[str, List[str]]:
"""Return a list of valid file names for the given class
This is called by the tf_dataset_utils.load_audio_directory() API.
# This uses shuffle_file_list_by_group() helper function so that the same "voices"
# are only present in a particular subset.
"""
assert shuffle_index_directory is None, 'Shuffling the index is not supported by this dataset'
file_list = []
index_path = f'{base_directory}/.index/{search_class}.txt'
# If the index file exists, then read it
if os.path.exists(index_path):
with open(index_path, 'r') as f:
for line in f:
file_list.append(line.strip())
else:
# Else find all files for the given class in the search directory
class_base_dir = f'{base_directory}/{search_class}/'
for root, _, files in os.walk(base_directory, followlinks=follow_links):
root = root.replace('\\', '/') + '/'
if not root.startswith(class_base_dir):
continue
for fname in files:
if not fname.lower().endswith(white_list_formats):
continue
abs_path = os.path.join(root, fname)
if os.path.getsize(abs_path) == 0:
continue
rel_path = os.path.relpath(abs_path, base_directory)
file_list.append(rel_path.replace('\\', '/'))
# Shuffle the voice groups
# then flatten into list
# This way, when the list is split into training and validation sets
# the same voice only appears in one subset
file_list = shuffle_file_list_by_group(file_list, get_sample_group_id_from_path)
# Write the file list file
mltk_core.get_mltk_logger().info(f'Generating index for "{search_class}" ({len(file_list)} samples): {index_path}')
os.makedirs(os.path.dirname(index_path), exist_ok=True)
with open(index_path, 'w') as f:
for p in file_list:
f.write(p + '\n')
if len(file_list) == 0:
raise RuntimeError(f'No samples found for class: {search_class}')
n_files = len(file_list)
if split[0] == 0:
start = 0
stop = math.ceil(split[1] * n_files)
# We want to ensure the same person isn't in both subsets
# So, ensure that the split point does NOT
# split with file names with the same hash
# recall: same hash = same person saying word
# Get the hash of the other subset
other_subset_hash = get_sample_group_id_from_path(file_list[stop])
# Keep moving the 'stop' index back while
# it's index matches the otherside
while stop > 0 and get_sample_group_id_from_path(file_list[stop-1]) == other_subset_hash:
stop -= 1
else:
start = math.ceil(split[0] * n_files)
# Get the hash of the this subset
this_subset_hash = get_sample_group_id_from_path(file_list[start])
# Keep moving the 'start' index back while
# it's index matches this side's
while start > 0 and get_sample_group_id_from_path(file_list[start-1]) == this_subset_hash:
start -= 1
stop = n_files
filenames = file_list[start:stop]
return search_class, filenames
def add_unknown_samples(
self,
directory:str,
sample_paths:Dict[str,str], # A dictionary: <class name>, [<sample paths relative to directory>],
split:Tuple[float,float],
follow_links:bool,
white_list_formats:List[str],
shuffle:bool,
seed:int,
**kwargs
):
"""Generate a list of all possible "unknown" samples for this given subset.
Then populate the "_unknown_" class with a random subset of the "unknown" samples.
The subset should be the approximate size of the "known" samples
"""
mlcommons_keywords_dir = f'{dataset_dir}/_unknown/mlcommons_keywords'
esc50_dir = f'{dataset_dir}/_unknown/esc-50/audio'
# Create a list of all possible "unknown" samples
file_list = []
# All all the mlcommons_keywords "unknown" samples that are not the "known" sample
all_keywords = []
for kw in os.listdir(mlcommons_keywords_dir):
if kw in my_model.classes:
continue
d = f'{mlcommons_keywords_dir}/{kw}'
if not os.path.isdir(d):
continue
for fn in os.listdir(d):
if fn.endswith('.wav'):
all_keywords.append(f'_unknown/mlcommons_keywords/{kw}/{fn}')
# Get a random subset of the "unknown" samples
# We only select 11k so balance with the "known" classes
rng = RandomState(seed)
all_keywords = sorted(all_keywords)
rng.shuffle(all_keywords)
file_list.extend(all_keywords[:11000])
# Add all the samples from the ESC-50 dataset which is 2k samples
# This way, we have random keywords and random noises in the "unknown" class's sample list
for fn in os.listdir(esc50_dir):
if not fn.endswith('.wav'):
continue
file_list.append(f'_unknown/esc-50/audio/{fn}')
# Sort the unknown samples by "voice"
# This helps to ensure voices are only present in a given subset
file_list = sorted(file_list)
file_list = shuffle_file_list_by_group(file_list, get_sample_group_id_from_path)
# Split the file list for the current subset
sample_paths['_unknown_'] = split_file_list(file_list, split)
def get_sample_group_id_from_path(p:str) -> str:
"""Extract the "voice hash" from the sample path.
This is used by shuffle_file_list_by_group() so that when we split
the dataset for training and validation, the same "voice" only appears
in one of the subsets.
"""
fn = os.path.basename(p)
fn = fn.replace('.wav', '').replace('.mp3', '')
# If this sample is from the Google speech commands dataset
# c53b335a_nohash_1.wav -> c53b335a
if '_nohash_' in fn:
toks = fn.split('_')
return toks[0]
# If this sample is from an mlcommons dataset
# common_voice_en_20127845.wav -> 20127845
if fn.startswith('common_voice_'):
toks = fn.split('_')
return toks[-1]
# If this sample is from a silabs synthetic dataset
# azure_af-ZA+AdriNeural+None+aww+medium+low+588b6ace.wav -> 588b6ace
if fn.startswith(('gcp_', 'azure_', 'aws_')):
toks = fn.split('+')
return toks[-1]
if '/esc-50/' in p:
toks = fn.split('-')
return toks[1]
raise RuntimeError(f'Failed to get voice hash from {p}')
my_model.dataset = MyDataset()
#################################################
# Audio Classifier Settings
#
# These are additional parameters to include in
# the generated .tflite model file.
# The settings are used by the ble_audio_classifier app
# NOTE: Corresponding command-line options will override these values.
# Controls the smoothing.
# Drop all inference results that are older than <now> minus window_duration
# Longer durations (in milliseconds) will give a higher confidence that the results are correct, but may miss some commands
my_model.model_parameters['average_window_duration_ms'] = 300
# Define a specific detection threshold for each class
#my_model.model_parameters['detection_threshold'] = 235
my_model.model_parameters['detection_threshold_list'] = list(map(lambda x: int(x*255), [.85, .85, .92, .92, .99, .99, 1.0]))
# Amount of milliseconds to wait after a keyword is detected before detecting the SAME keyword again
# A different keyword may be detected immediately after
my_model.model_parameters['suppression_ms'] = 700
# The minimum number of inference results to average when calculating the detection value
my_model.model_parameters['minimum_count'] = 2
# Set the volume gain scaler (i.e. amplitude) to apply to the microphone data. If 0 or omitted, no scaler is applied
my_model.model_parameters['volume_gain'] = 0.0
# This the amount of time in milliseconds between audio processing loops
# Since we're using the audio detection block, we want this to be as short as possible
my_model.model_parameters['latency_ms'] = 10
# Enable verbose inference results
my_model.model_parameters['verbose_model_output_logs'] = False
##########################################################################################
# The following allows for running this model training script directly, e.g.:
# python keyword_spotting_pacman_v3.py
#
# Note that this has the same functionality as:
# mltk train keyword_spotting_pacman_v3
#
if __name__ == '__main__':
from mltk import cli
# Setup the CLI logger
cli.get_logger(verbose=True)
# If this is true then this will do a "dry run" of the model testing
# If this is false, then the model will be fully trained
test_mode_enabled = True
# Train the model
# This does the same as issuing the command: mltk train keyword_spotting_pacman_v3-test --clean)
train_results = mltk_core.train_model(my_model, clean=True, test=test_mode_enabled)
print(train_results)
# Evaluate the model against the quantized .h5 (i.e. float32) model
# This does the same as issuing the command: mltk evaluate keyword_spotting_pacman_v3-test
tflite_eval_results = mltk_core.evaluate_model(my_model, verbose=True, test=test_mode_enabled)
print(tflite_eval_results)
# Profile the model in the simulator
# This does the same as issuing the command: mltk profile keyword_spotting_pacman_v3-test
profiling_results = mltk_core.profile_model(my_model, test=test_mode_enabled)
print(profiling_results)