An overview of how to detect keywords using Machine Learning (ML) on an embedded device
The basic idea behind keyword spotting is a machine learning (ML) model is trained to detect specific words, e.g. "On", "Off". On an embedded device, microphone audio is fed into the ML model, and if it detects a word it notifies the firmware application which reacts accordingly.
ML Model
Notify firmware application
1) Capture microphone audio
2) Convert audio into 2D spectrogram image
Machine Learning
Model
3) Provide spectrogram image to ML Model.
The output of the ML model is the probability of each possible keyword being in the spectrogram, e.g:
On: 95%, Off: 4%, Silence: 0.3%, Unknown: 0.7%
IF MAX(averaged_predictions) > threshold:
keyword_id = ARGMAX(averaged_predictions)
notify_application(keyword_id)
Optionally, repeat step 1-3 multiple times and calculate a running average of each keywords' probability
4) If the probability of a given keyword exceeds a threshold, then the keyword is considered detected. Notify the application.
The following video shows a 1s audio sample being processed in increments of 100ms. So the ML model "sees" the same audio sample shifted by 100ms. 100ms is the simulated amount of time it takes the embedded device to generate the spectrogram and process it in the ML model. In this case, there are many predictions in the running average giving high confidence of a keyword detection.
NOTE: This video was created using the view_audio MLTK command
The following video shows a 1s audio sample being processed in increments of 400ms. So the ML model "sees" the same audio sample shifted by 400ms. 400ms is the simulated amount of time it takes the embedded device to generate the spectrogram and process it in the ML model. In this case, there are few predictions in the running average giving low confidence of a keyword detection.
The video is playing in a loop. In reality, the embedded device would only "see" the audio 2 times
The following are typical approaches to reducing processing time: