Synthetic Audio Dataset Generation

This tutorial describes how to use Text-to-Speech (TTS) features of the:

clouds to generate synthetic keyword audio datasets.

With this, keyword spotting machine learning models may be trained to detected custom keywords.

In this tutorial, we use the AudioDatasetGenerator Python package to generate a synthetic “Alexa” dataset.

Content

This tutorial is divided into the following sections:

Overview

A quality keyword spotting dataset should have the following characteristics:

  • Lots of samples - At least 3k+ samples for each keyword (10-100k+ is recommended)

  • Lots of different voices - The more people speaking the keywords the better as this will help account of different accents

  • Same voice, different pronunciations - The same speaker saying the keyword in different ways (e.g. fast, slow, “happy”, “sad”, “excited”, etc.)

  • Lots of negative samples - Words that are not the keyword(s) but sound similar or would be commonly heard in the field

All of these characterisitics help to make a representative dataset, and the more representative the dataset (i.e. the more similar the data is to what would be heard in the field) the better the machine learning model will likely perform.

Recording real people

Ideally, the dataset would be generated by recording 5-50k+ different people saying the keyword(s) as this would help to make the most representative dataset. While 3rd-party services are available to aid with dataset generation they can be expensive ($20k+).

Pros

  • More representative samples

Cons

  • Expensive (time and money)

  • Harder to generate “negative” samples (harder to record non-keywords)

  • More data cleaning required (need to verify audio samples as humans make errors)

Synthetic generation

An alternative approach is to synthetically generate the audio samples using Text-to-Speech (TTS) services. Cloud services like Google Cloud Platform (GCP), Microsoft (Azure), and Amazon Web Services (AWS) allow for converting written text to audio samples. While these services are primarily intended for converting long strings of text they also work for single words.

A major concern with this approach is that the generated audio samples would not be representative (i.e. they would sound too “robotic”), however, in many cases the audio generated by these TTS services sounds realistic thanks to new AI techniques.

Pros

  • Less expensive (time and money)

  • Easier to generate “negative” samples

  • Little or no data cleaning required

Cons

  • Less representative samples (limited by the number of “voices” offered by the TTS services)

Note about synthetic augmentations

Many of the TTS services allow for augmenting the audio by adjusting the “pitch” or “speaking rate”. While this can help to increase the number of audio samples, these augmentations should be used sparingly as they do not fundamentally change the underlying audio. As such, a large number of samples with small increments in the augmentation settings will likely not help the machine learning model to generalize (in fact, they could cause the ML model to over fit the data).

Note about languages

Each TTS service supports numerous different languages. While there are exceptions, it has been found that English keywords generated with different languages are still valid (e.g. English text generated with a Russian “voice” still sounds like a valid English audio sample with a Russian accent). This is very useful as it allows for generating keywords with different accents and thus allows for generating a more representative dataset.

When using different languages, it is recommended to spot check the generated audio samples to ensure they’re valid.

Note about the “negative” class

Training a machine learning model to detect a one or more keywords is (relatively) easy. The hard part is training a machine learning model to reliably detect the keywords and reject everything else.

For example, consider a “smart” trashcan whose lid opens with the keyword “open” and closes with the keyword “close”. The trashcan should only react to those two keywords and ignore all other words and sounds (i.e. it should have a low false-positive rate). For this application, a low false-positive rate is critical otherwise the trashcan lid would be constantly opening/closing while having a conversation next to it.

Training an ML model to trigger on the “open” and “close” keywords is fairly simple. The hard part is getting the ML model to ignore everything else (e.g. words like “opening”, “closet”, etc.).

To help solve this problem, the dataset should have a large “negative” class – it should have lots of samples that sound similar to the keywords. Generating the negative samples is fairly trivial with synthetic dataset generation as it is just a matter of supplying the negative keywords to the generation script.

Note about cost

While using the cloud TTS services is relatively cheap, they are NOT FREE!

These services typically charge per character that is sent to the generation request. The TTS services usually offer a certain number of free characters per month and then charge once the limit is exceeded.

For instance, consider the Google Text-to-Speech pricing table (as of January 2023, see their website for the latest prices):

Feature

Free per month

Price after free usage limit is reached

Standard (non-WaveNet, non-Neural2) voices

0 to 4 million characters

$0.000004 USD per character ($4.00 USD per 1 million characters)

WaveNet voices

0 to 1 million characters

$0.000016 USD per character ($16.00 USD per 1 million characters)

Neural2 voices

0 to 1 million characters

$0.000016 USD per character ($16.00 USD per 1 million characters)

Even with this pricing, a lot of keyword audio samples may be generated for very little money.

NOTE: The number of characters sent is not only dependent on the length of the keyword but also on the audio markup language.

For instance, the following is an example request sent to the Microsoft cloud:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <mstts:express-as style="cheerful">
            <prosody rate="fast" pitch="medium">
                open
            </prosody>
        </mstts:express-as>
    </voice>
</speak>

All of the characters in the request (minus the whitespace) contribute to the character count.

To help determine the character counts, the MLTK features the API: AudioDatasetGenerator.count_characters

Google Cloud Platform (GCP) Setup

From Google Cloud Platform Text-to-Speech:

Convert text into natural-sounding speech using an API powered by the best of Google’s AI technologies.

Features

Quick Links

Microsoft Azure Setup

From Microsoft Azure Text-to-Speech:

Text-to-speech enables your applications, tools, or devices to convert text into humanlike synthesized speech. The text-to-speech capability is also known as speech synthesis. Use humanlike prebuilt neural voices out of the box, or create a custom neural voice that’s unique to your product or brand.

Features

  • For a full list of supported voices, languages, and locales, see Language and voice support for the Speech service

  • New customers get $200 in free credits to spend on Text-to-Speech

Quick Links

Amazon Web Services (AWS) Setup

From Amazon Polly

Amazon Polly is a cloud service that converts text into lifelike speech. You can use Amazon Polly to develop applications that increase engagement and accessibility. Amazon Polly supports multiple languages and includes a variety of lifelike voices, so you can build speech-enabled applications that work in multiple locations and use the ideal voice for your customers. With Amazon Polly, you only pay for the text you synthesize.

Quick Links

Alexa Example

The MLTK features the Python package: AudioDatasetGenerator which allows for generating a synthetic keyword dataset using the Google, Microsoft, and Amazon clouds.

The Python script: alexa_dataset_generator.py demonstrates how to use this Python package to generate a synthetic “Alexa” dataset.

The following provides more details about this script.

NOTE: In the example below, the max_count setting is set to a small value to reduce cloud cost. In practice, this value should be set to a much larger value (e.g. 10000).

NOTE: Click HERE to execute the following code in your web-browser

# Install the MLTK Python package into the local Notebook environment
%pip install silabs-mltk --upgrade
# Import the necessary Python packages
import os
import json
import tqdm
import tempfile
from mltk.utils.audio_dataset_generator import (
    AudioDatasetGenerator,
    Keyword,
    Augmentation,
    VoiceRate,
    VoicePitch
)
# NOTE: The following credentials are provided as an example.
#       You must generate your own credentials to run this example

###################################################################################################
# Configure your Azure credentials
# See: https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/get-started-text-to-speech?pivots=programming-language-python
os.environ['SPEECH_KEY'] = 'e8699507e7c04a4cb8afdba62986987c'
os.environ['SPEECH_REGION'] = 'westus2'
###################################################################################################
# Configure your Google credentials
# See: https://codelabs.developers.google.com/codelabs/cloud-text-speech-python3

# NOTE: The "serivce account" JSON was copied into this Python script for demonstration purposes.
#       You could also just set GOOGLE_APPLICATION_CREDENTIALS to point to your service account .json file and
#       and remove the following.
#
#       If you do copy and paste into this script, be sure that the "private_key" is on a single line, e.g.:
#       "private_key_id": ...,
#       "private_key": "-----BEGIN PRIVATE KEY---- ....",
#       "client_email":  ...,
#
#       NOT:
#
#       "private_key": "-----BEGIN PRIVATE KEY--- ...
#       NEB6Y5ZODG2DYJmM+JdAHcNaPRD9/hAMRG3jl2jisVZO ...
#       03aEXJYOEWTbLWfPYxpNQyz4wKBgQDD+yVYWCrbXEECn ...
#       ... -----END PRIVATE KEY-----\n",
#       "client_email": ...,
#       "client_id": ...,
#
gcp_service_account_json_path = f'{tempfile.gettempdir()}/gcp_key.json'
gcp_service_account_json = """
{
  "type": "service_account",
  "project_id": "strange-firefly-374023",
  "private_key_id": "8e074b2dc4da026810d6b728e1588e79a745a08c",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvQIBADANBgkqhkiG9w0BAQEFAASCBKcwggSjAgEAAoIBAQCuQ4FpO6IlIB78\nmHYRHb1Ei2PCEgtthRlXbQwE6RsWppTtopQVpLSBXs30FRarpd6d4hgqeL46gC2d\nCRHH8oMrgKMB4pAGzHCEfJd/XjKckNsyIPLTqGBjAu3pT/wMukIHdYiYzDD6qjr3\nuGghP8HkT1gXgcGdFkpLWVoj9b3M6b5/3cVgBthciycCCYkHqnFOn6MTEe6OFMPZ\nWXY3FrEwyWjIWIIIvPbQaNoIJs92Gb+FFGsG2Ta63TgsZmBvVHjtd3A98EwmvwSz\nbIPXjqh5qLh3YCHdGT42MqBXrInN11kMyOC56A2Ic4mvrQ3I8oAPOs2L6ugLwX9J\nS6Sq5JW1AgMBAAECggEAT7pS2vKNnK61fpvCaNJSZangWkonMFRU48rgVN7RpetQ\n9+gKGFziuM3HLIT5ek7JKzLmG4higCFkvRQJLpGlsaGI8rPVcUbXs8XNCljujvM3\nVhf9ARln/+S3NKeDic8tpnv/oujI/+YiVHPqMEwbSXmDtD2Jd3VbSF34/7rOu5Dz\n56bGmBbNEB6Y5ZODG2DYJmM+JdAHcNaPRD9/hAMRG3jl2jisVZOgrleNelkZnrPe\n9t0uWqIv5EJItoVBZd+EzADFfjfTDrKfWv1QixeMiak1aTbs5bHKNK5ecYFFMpms\nCIVgp3wRxq7nFrJkTnWdJzeAFjQw4CKWLmN4xc2FgQKBgQDjobnxjGO7GQ1pfEiQ\nVsSuWJiXy63trU6jwrrhR1B9XUPh6VivH2dZ4lPfPywER9LX6oMTn6AIzihTPq1I\n19eskH0H6hwBw2yDzWgHZRMHB9xs5Ys+HiBKWrZ9NW77uWH1D9g/EJcN6A2ZL2ig\nK03aEXJYOEWTbLWfPYxpNQyz4wKBgQDD+yVYWCrbXEECnA0fohw8wIRo6dS6G84M\nMCzkr0YooxPb8zrIIm+mv7PAcCElaSz4LZbC2Hcb1mvV9p6o2IEUHqNgabWLFWiD\ng7CC7rm4qEE87p5U4oBUhPCiuZpA3UeAqBhxMWd1oXw5rJVXenNe+7G4JZKxERU/\nQIf7cw6zhwKBgBy5dctjWdpsSOL8yfNc36jYiTjufN43Nms30XlIFIIdWMmTNpuy\nrMoM42SShi1sGtEgSLYbOIij6zbF+/vrMM4X1Y9AHZSjYngnXW9Bc+s5NLmRJccK\n6iw30jtumLivJgtUmocqwsUAeWbRMrSzgjl4ZiN3xl/aIfkcPTGxfg7dAoGAVz6b\njmuZkJPOIRJFSVrKhUUS7P2DhOJR5N0hbyCT9A09DwKFnYiu+aWHqNiB+PyMV2M8\nJTtmMs9OrC6gzPus4r8M7iPA/Myn/TwHvRH3PbwxZqW3eIRoqrePxHEpuUyIwz6R\nuvpKW3RrL+WjihDqAVO89wRK/GZldgYNQyQiXEsCgYEAj+8nsq1UGod7SqfPiA/n\n3Wur4A+UYT8/nuaTK2WW/GTBC+eDDjRE1lZ3f/UQGTSXLSV7T1mw4a7EKrkFl36P\nLnVeFBTB3UCd8JJ0LPBtOqru9I8ns+a4FqOPMljoYElGtyT1Oy+vxfwYA7cmRz/d\n49bE21meuV3pRV1QWrrteEM=\n-----END PRIVATE KEY-----\n",
  "client_email": "my-tts-sa@strange-firefly-374023.iam.gserviceaccount.com",
  "client_id": "109154742213348109867",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/my-tts-sa%40strange-firefly-374023.iam.gserviceaccount.com"
}
"""
gcp_service_account_json = gcp_service_account_json.strip().replace(',\n', ',').replace('\n', '\\n').replace('{\\n', '{\n').replace('\\n}', '\n}')

with open(gcp_service_account_json_path, 'w') as f:
    f.write(gcp_service_account_json)

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = gcp_service_account_json_path
with open(os.environ['GOOGLE_APPLICATION_CREDENTIALS'], 'r') as f:
    credentials = json.load(f)
os.environ['PROJECT_ID'] = credentials['project_id']
###################################################################################################
# Configure your AWS credentials
# See: https://docs.aws.amazon.com/polly/latest/dg/get-started-what-next.html
os.environ['AWS_ACCESS_KEY_ID'] = 'AKIATZWWZR5TWBUNF6IX'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'v0IRHPUGeNwj1CA7saVduF1uxW84bgkzQpOWLfdr'
os.environ['AWS_DEFAULT_REGION'] = 'us-west-2'
###################################################################################################
# Define the directory where the dataset will be generated
OUT_DIR = f'{tempfile.gettempdir()}/alexa_dataset'.replace('\\', '/')
###################################################################################################
# Define the keywords and corresponding aliases to generate
# For the _unknown_ class (e.g. negative class), we want words that sound similar to "alexa".
# NOTE: If the base word starts with an underscore, it is not included in the generation list.
# So the generation list will be:
# alexa, ehlexa, eelexa, aalexa
# ah, aag, a, o, uh, ...
#
# The dataset will have the directory structure:
# $TEMP/alexa_dataset/alexa/sample1.wav
# $TEMP/alexa_dataset/alexa/sample2.wav
# $TEMP/alexa_dataset/alexa/...
# $TEMP/alexa_dataset/_unknown_/sample1.wav
# $TEMP/alexa_dataset/_unknown_/sample2.wav
# $TEMP/alexa_dataset/_unknown_/...
KEYWORDS = [
    Keyword('alexa',
        max_count=100, # In practice, the max count should be much larger (e.g. 10000)
        aliases=('ehlexa', 'eelexa', 'aalexa')
    ),
    Keyword('_unknown_',
        max_count=200, # In practice, the max count should be much larger (e.g. 20000)
        aliases=(
        'ah', 'aah', 'a', 'o', 'uh', 'ee', 'aww', 'ala',
        'alex', 'lex', 'lexa', 'lexus', 'alexus', 'exus', 'exa',
        'alert', 'alec', 'alef', 'alee', 'ales', 'ale',
        'aleph', 'alefs', 'alevin', 'alegar', 'alexia',
        'alexin', 'alexine', 'alencon', 'alexias',
        'aleuron', 'alembic', 'alice', 'aleeyah'
    ))
]

print('NOTE: In practice, the "max_count" KEYWORDS setting should be a much larger value (e.g. 10000)')
###################################################################################################
# Define the augmentations to apply the keywords
AUGMENTATIONS = [
    Augmentation(rate=VoiceRate.xslow, pitch=VoicePitch.low),
    Augmentation(rate=VoiceRate.xslow, pitch=VoicePitch.medium),
    Augmentation(rate=VoiceRate.xslow, pitch=VoicePitch.high),
    Augmentation(rate=VoiceRate.medium, pitch=VoicePitch.low),
    Augmentation(rate=VoiceRate.medium, pitch=VoicePitch.medium),
    Augmentation(rate=VoiceRate.medium, pitch=VoicePitch.high),
    Augmentation(rate=VoiceRate.xfast, pitch=VoicePitch.low),
    Augmentation(rate=VoiceRate.xfast, pitch=VoicePitch.medium),
    Augmentation(rate=VoiceRate.xfast, pitch=VoicePitch.high),
]
###################################################################################################
# Instantiate the AudioDatasetGenerator
with AudioDatasetGenerator(
    out_dir=OUT_DIR,
    n_jobs=8 # We want to generate the keywords across 8 parallel jobs
) as generator:
    # Load the cloud backends, installing the Python packages if necessary
    generator.load_backend('aws', install_python_package=True)
    generator.load_backend('gcp', install_python_package=True)
    generator.load_backend('azure', install_python_package=True)

    print('Listing voices ...')
    voices = generator.list_voices()

    # Generate a list of all possible configurations, randomly shuffle, then truncate
    # based on the "max_count" specified for each keyword
    print('Listing configurations ...')
    all_configurations = generator.list_configurations(
        keywords=KEYWORDS,
        augmentations=AUGMENTATIONS,
        voices=voices,
        truncate=True,
        seed=42
    )
    n_configs = sum(len(x) for x in all_configurations.values())

    # Print a summary of the configurations
    print(generator.get_summary(all_configurations))

    input(
        '\nWARNING: Running this script is NOT FREE!\n\n'
        'Each cloud backend charges a different rate per character.\n'
        'The character counts are listed above.\n\n'
        'Refer to each backend\'s docs for the latest pricing:\n'
        '- AWS: https://aws.amazon.com/polly/pricing\n'
        '- Azure: https://azure.microsoft.com/en-us/pricing/details/cognitive-services/speech-services\n'
        '- Google: https://cloud.google.com/text-to-speech/pricing\n'
        '\nPress "enter" to continue and generate the dataset\n'
    )

    # Generate the dataset (with pretty progress bars)
    print(f'Generating keywords at: {generator.out_dir}\n')
    with tqdm.tqdm(total=n_configs, desc='Overall'.rjust(10), unit='word', position=1) as pb_outer:
        for keyword, config_list in all_configurations.items():
            with tqdm.tqdm(desc=keyword.value.rjust(10), total=len(config_list), unit='word', position=0) as pb_inner:
                for config in config_list:
                    generator.generate(
                        config,
                        on_finished=lambda _: (pb_inner.update(1), pb_outer.update(1))
                    )
                generator.join() # Wait for the current keyword to finish before continuing to the next
Listing voices ...
Listing configurations ...
Voice Counts
---------------------
  aws   : 21
  azure : 112
  gcp   : 96
  Total : 229

Keyword Counts
---------------------
  alexa:
    aws   : 8.0
    azure : 60.0
    gcp   : 32.0
    Total : 100.0
  _unknown_:
    azure : 105.0
    gcp   : 80.0
    aws   : 15.0
    Total : 200.0
  Overall total: 300.0

Character Counts
---------------------
  alexa:
    aws   : 394.0
    azure : 10.8k
    gcp   : 184.0
  _unknown_:
    azure : 19.8k
    gcp   : 369.0
    aws   : 623.0
  Backend totals:
    aws   : 1.0k
    azure : 30.6k
    gcp   : 553.0

Generating keywords at: E://alexa_dataset\n
     alexa: 100%|██████████| 100/100 [00:07<00:00, 13.52word/s]
 _unknown_: 100%|██████████| 200/200 [00:12<00:00, 16.17word/s]
   Overall: 100%|██████████| 300/300 [00:19<00:00, 15.17word/s]
# Convert the generated dataset directory into an archive (.tar.gz)
from mltk.utils.archive import gzip_directory_files

print(f'Generating archive from {OUT_DIR} (this may take awhile) ...')
archive_path = gzip_directory_files(OUT_DIR)
print(f'Dataset archive path: {archive_path}')
try:
  # If this is executing from Google Colab, then download the dataset archive
  from google.colab import files
  files.download(archive_path)
except:
  pass

Next Steps

See the Keyword Spotting - Alexa tutorial for how to use this dataset to train a keyword spotting model.