mltk.datasets.audio.speech_commands.speech_commands_v2

Google Speech Commands v2

https://www.tensorflow.org/datasets/catalog/speech_commands

This is a set of one-second .wav audio files, each containing a single spoken English word. These words are from a small set of commands, and are spoken by a variety of different speakers. The audio files are organized into folders based on the word they contain, and this data set is designed to help train simple machine learning models. This dataset is covered in more detail at https://arxiv.org/abs/1804.03209.

It’s licensed under the Creative Commons BY 4.0 license. See the LICENSE file in this folder for full details. Its original location was at http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz.

History

Version 0.01 of the data set was released on August 3rd 2017 and contained 64,727 audio files.

This is version 0.02 of the data set containing 105,829 audio files, released on April 11th 2018.

Collection

The audio files were collected using crowdsourcing, see aiyprojects.withgoogle.com/open_speech_recording for some of the open source audio collection code we used (and please consider contributing to enlarge this data set). The goal was to gather examples of people speaking single-word commands, rather than conversational sentences, so they were prompted for individual words over the course of a five minute session. Twenty core command words were recorded, with most speakers saying each of them five times. The core words are “Yes”, “No”, “Up”, “Down”, “Left”, “Right”, “On”, “Off”, “Stop”, “Go”, “Zero”, “One”, “Two”, “Three”, “Four”, “Five”, “Six”, “Seven”, “Eight”, and “Nine”. To help distinguish unrecognized words, there are also ten auxiliary words, which most speakers only said once. These include “Bed”, “Bird”, “Cat”, “Dog”, “Happy”, “House”, “Marvin”, “Sheila”, “Tree”, and “Wow”.

Organization

The files are organized into folders, with each directory name labelling the word that is spoken in all the contained audio files. No details were kept of any of the participants age, gender, or location, and random ids were assigned to each individual. These ids are stable though, and encoded in each file name as the first part before the underscore. If a participant contributed multiple utterances of the same word, these are distinguished by the number at the end of the file name. For example, the file path happy/3cfc6b3a_nohash_2.wav indicates that the word spoken was “happy”, the speaker’s id was “3cfc6b3a”, and this is the third utterance of that word by this speaker in the data set. The ‘nohash’ section is to ensure that all the utterances by a single speaker are sorted into the same training partition, to keep very similar repetitions from giving unrealistically optimistic evaluation scores.

Processing

The original audio files were collected in uncontrolled locations by people around the world. We requested that they do the recording in a closed room for privacy reasons, but didn’t stipulate any quality requirements. This was by design, since we wanted examples of the sort of speech data that we’re likely to encounter in consumer and robotics applications, where we don’t have much control over the recording equipment or environment. The data was captured in a variety of formats, for example Ogg Vorbis encoding for the web app, and then converted to a 16-bit little-endian PCM-encoded WAVE file at a 16000 sample rate. The audio was then trimmed to a one second length to align most utterances, using the extract_loudest_section tool. The audio files were then screened for silence or incorrect words, and arranged into folders by label.

Background Noise

To help train networks to cope with noisy environments, it can be helpful to mix in realistic background audio. The _background_noise_ folder contains a set of longer audio clips that are either recordings or mathematical simulations of noise. For more details, see the _background_noise_/README.md.

Citations

If you use the Speech Commands dataset in your work, please cite it as:

@article{speechcommandsv2,
  author = {{Warden}, P.},
    title = "{Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition}",
  journal = {ArXiv e-prints},
archivePrefix = "arXiv",
  eprint = {1804.03209},
primaryClass = "cs.CL",
keywords = {Computer Science - Computation and Language, Computer Science - Human-Computer Interaction},
    year = 2018,
    month = apr,
    url = {https://arxiv.org/abs/1804.03209},
}

Credits

Massive thanks are due to everyone who donated recordings to this data set, I’m very grateful. I also couldn’t have put this together without the help and support of Billy Rutledge, Rajat Monga, Raziel Alvarez, Brad Krueger, Barbara Petit, Gursheesh Kour, and all the AIY and TensorFlow teams.

Pete Warden, petewarden@google.com

Variables

DOWNLOAD_URL

The public download URL

VERIFY_SHA1

The SHA1 hash of the dataset archive

CLASSES

The class label supported by this dataset

Functions

list_valid_filenames_in_directory(...)

Return a list of valid file names for the given class

load_clean_data([dest_dir, dest_subdir, ...])

Load the data and remove all "invalid samples".

load_data([dest_dir, dest_subdir, ...])

Download and extract the Google Speech commands dataset v2, and return the directory path to the extracted dataset

DOWNLOAD_URL = 'http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz'

The public download URL

VERIFY_SHA1 = '4264eb9753e38eef2ec1d15dfac8441f09751ca9'

The SHA1 hash of the dataset archive

CLASSES = ['Yes', 'No', 'Up', 'Down', 'Left', 'Right', 'On', 'Off', 'Stop', 'Go', 'Zero', 'One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven', 'Eight', 'Nine', 'Bed', 'Bird', 'Cat', 'Dog', 'Happy', 'House', 'Marvin', 'Sheila', 'Tree', 'Wow']

The class label supported by this dataset

load_data(dest_dir=None, dest_subdir='datasets/speech_commands/v2', clean_dest_dir=False)[source]

Download and extract the Google Speech commands dataset v2, and return the directory path to the extracted dataset

Parameters:
  • dest_dir (str) – Absolute path of where the dataset should be extracted. If omitted, defaults to MLTK_CACHE_DIR/<dest_subdir>/ OR ~/.mltk/<dest_subdir>/

  • dest_subdir – Sub-directory of where the dataset should be extracted, only used if dest_dir is omitted default: datasets/speech_commands/v2

Return type:

str

Returns:

Directory path of extracted dataset

load_clean_data(dest_dir=None, dest_subdir='datasets/speech_commands/v2_cleaned', clean_in_place=True, clean_dest_dir=False)[source]

Load the data and remove all “invalid samples”. That are samples that were manually determined to not be valid. These samples are specified in invalid_samples.py

Parameters:
  • dest_dir (str) – Absolute path of where the dataset should be extracted and cleaned. If omitted, defaults to MLTK_CACHE_DIR/<dest_subdir>/ OR ~/.mltk/<dest_subdir>

  • dest_subdir – Sub-directory of where the dataset should be extracted and cleaned, default: datasets/speech_commands/v2_cleaned

  • clean_in_place – If true then the extracted dataset is cleaned in-place, If false then the cleaned samples are copied to <dest_dir>/<dest_subdir>/_cleaned

Return type:

str

Returns:

Directory path of extracted and cleaned dataset

list_valid_filenames_in_directory(base_directory, search_class, white_list_formats, split, follow_links, shuffle_index_directory)[source]

Return a list of valid file names for the given class

Per the dataset README.md:

We want to keep files in the same training, validation, or testing sets even if new ones are added over time. This makes it less likely that testing samples will accidentally be reused in training when long runs are restarted for example. To keep this stability, a hash of the filename is taken and used to determine which set it should belong to. This determination only depends on the name and the set proportions, so it won’t change as other files are added.

It’s also useful to associate particular files as related (for example words spoken by the same person), so anything after ‘_nohash_’ in a filename is ignored for set determination. This ensures that ‘bobby_nohash_0.wav’ and ‘bobby_nohash_1.wav’ are always in the same set, for example.

Return type:

Tuple[str, List[str]]

Parameters:
  • base_directory (str) –

  • search_class (str) –

  • white_list_formats (List[str]) –

  • split (float) –

  • follow_links (bool) –

  • shuffle_index_directory (str) –