Reference Datasets

The MLTK comes with datasets that are used by the reference models.

The source code for these datasets may be found on Github: https://github.com/siliconlabs/mltk/tree/master/mltk/datasets.

Audio Datasets

Google Speech Commands v2

https://www.tensorflow.org/datasets/catalog/speech_commands

This is a set of one-second .wav audio files, each containing a single spoken English word. These words are from a small set of commands, and are spoken by a variety of different speakers. The audio files are organized into folders based on the word they contain, and this data set is designed to help train simple machine learning models. This dataset is covered in more detail at https://arxiv.org/abs/1804.03209.

It’s licensed under the Creative Commons BY 4.0 license. See the LICENSE file in this folder for full details. Its original location was at http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz.

History

Version 0.01 of the data set was released on August 3rd 2017 and contained 64,727 audio files.

This is version 0.02 of the data set containing 105,829 audio files, released on April 11th 2018.

Collection

The audio files were collected using crowdsourcing, see aiyprojects.withgoogle.com/open_speech_recording for some of the open source audio collection code we used (and please consider contributing to enlarge this data set). The goal was to gather examples of people speaking single-word commands, rather than conversational sentences, so they were prompted for individual words over the course of a five minute session. Twenty core command words were recorded, with most speakers saying each of them five times. The core words are “Yes”, “No”, “Up”, “Down”, “Left”, “Right”, “On”, “Off”, “Stop”, “Go”, “Zero”, “One”, “Two”, “Three”, “Four”, “Five”, “Six”, “Seven”, “Eight”, and “Nine”. To help distinguish unrecognized words, there are also ten auxiliary words, which most speakers only said once. These include “Bed”, “Bird”, “Cat”, “Dog”, “Happy”, “House”, “Marvin”, “Sheila”, “Tree”, and “Wow”.

Organization

The files are organized into folders, with each directory name labelling the word that is spoken in all the contained audio files. No details were kept of any of the participants age, gender, or location, and random ids were assigned to each individual. These ids are stable though, and encoded in each file name as the first part before the underscore. If a participant contributed multiple utterances of the same word, these are distinguished by the number at the end of the file name. For example, the file path happy/3cfc6b3a_nohash_2.wav indicates that the word spoken was “happy”, the speaker’s id was “3cfc6b3a”, and this is the third utterance of that word by this speaker in the data set. The ‘nohash’ section is to ensure that all the utterances by a single speaker are sorted into the same training partition, to keep very similar repetitions from giving unrealistically optimistic evaluation scores.

Processing

The original audio files were collected in uncontrolled locations by people around the world. We requested that they do the recording in a closed room for privacy reasons, but didn’t stipulate any quality requirements. This was by design, since we wanted examples of the sort of speech data that we’re likely to encounter in consumer and robotics applications, where we don’t have much control over the recording equipment or environment. The data was captured in a variety of formats, for example Ogg Vorbis encoding for the web app, and then converted to a 16-bit little-endian PCM-encoded WAVE file at a 16000 sample rate. The audio was then trimmed to a one second length to align most utterances, using the extract_loudest_section tool. The audio files were then screened for silence or incorrect words, and arranged into folders by label.

Background Noise

To help train networks to cope with noisy environments, it can be helpful to mix in realistic background audio. The _background_noise_ folder contains a set of longer audio clips that are either recordings or mathematical simulations of noise. For more details, see the _background_noise_/README.md.

Citations

If you use the Speech Commands dataset in your work, please cite it as:

@article{speechcommandsv2,
  author = {{Warden}, P.},
    title = "{Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition}",
  journal = {ArXiv e-prints},
archivePrefix = "arXiv",
  eprint = {1804.03209},
primaryClass = "cs.CL",
keywords = {Computer Science - Computation and Language, Computer Science - Human-Computer Interaction},
    year = 2018,
    month = apr,
    url = {https://arxiv.org/abs/1804.03209},
}

Credits

Massive thanks are due to everyone who donated recordings to this data set, I’m very grateful. I also couldn’t have put this together without the help and support of Billy Rutledge, Rajat Monga, Raziel Alvarez, Brad Krueger, Barbara Petit, Gursheesh Kour, and all the AIY and TensorFlow teams.

Pete Warden, petewarden@google.com

Image Datasets

Rock, Paper, Scissors v1

Contains grayscale images of the hand gestures:

  • rock

  • paper

  • scissors

Rock, Paper, Scissors v2

Contains grayscale images of the hand gestures:

  • rock

  • paper

  • scissors

  • _unknown_

MNIST

This is a dataset of 60,000 28x28 grayscale images of the 10 digits, along with a test set of 10,000 images. More info can be found at the MNIST homepage

param path

path where to cache the dataset locally (relative to ~/.keras/datasets).

returns

(x_train, y_train), (x_test, y_test).

rtype

Tuple of NumPy arrays

x_train: uint8 NumPy array of grayscale image data with shapes

(60000, 28, 28), containing the training data. Pixel values range from 0 to 255.

y_train: uint8 NumPy array of digit labels (integers in range 0-9)

with shape (60000,) for the training data.

x_test: uint8 NumPy array of grayscale image data with shapes

(10000, 28, 28), containing the test data. Pixel values range from 0 to 255.

y_test: uint8 NumPy array of digit labels (integers in range 0-9)

with shape (10000,) for the test data.

Example:

(x_train, y_train), (x_test, y_test) = mnist.load_data()
assert x_train.shape == (60000, 28, 28)
assert x_test.shape == (10000, 28, 28)
assert y_train.shape == (60000,)
assert y_test.shape == (10000,)
License:

Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset, which is a derivative work from original NIST datasets. MNIST dataset is made available under the terms of the Creative Commons Attribution-Share Alike 3.0 license.

CIFAR10

This is a dataset of 50,000 32x32 color training images and 10,000 test images, labeled over 10 categories. See more info at the CIFAR homepage

The classes are:

  • airplane

  • automobile

  • bird

  • cat

  • deer

  • dog

  • frog

  • horse

  • ship

  • truck

returns

(x_train, y_train), (x_test, y_test).

rtype

Tuple of NumPy arrays

x_train: uint8 NumPy array of grayscale image data with shapes

(50000, 32, 32, 3), containing the training data. Pixel values range from 0 to 255.

y_train: uint8 NumPy array of labels (integers in range 0-9)

with shape (50000, 1) for the training data.

x_test: uint8 NumPy array of grayscale image data with shapes

(10000, 32, 32, 3), containing the test data. Pixel values range from 0 to 255.

y_test: uint8 NumPy array of labels (integers in range 0-9)

with shape (10000, 1) for the test data.

Example:

(x_train, y_train), (x_test, y_test) = cifar10.load_data()
assert x_train.shape == (50000, 32, 32, 3)
assert x_test.shape == (10000, 32, 32, 3)
assert y_train.shape == (50000, 1)
assert y_test.shape == (10000, 1)

Fashion-MNIST

This is a dataset of 60,000 28x28 grayscale images of 10 fashion categories, along with a test set of 10,000 images. This dataset can be used as a drop-in replacement for MNIST.

The classes are:

  • T-shirt/top

  • Trouser

  • Pullover

  • Dress

  • Coat

  • Sandal

  • Shirt

  • Sneaker

  • Bag

  • Ankle boot

Returns: Tuple of NumPy arrays: (x_train, y_train), (x_test, y_test).

x_train: uint8 NumPy array of grayscale image data with shapes (60000, 28, 28), containing the training data.

y_train: uint8 NumPy array of labels (integers in range 0-9) with shape (60000,) for the training data.

x_test: uint8 NumPy array of grayscale image data with shapes (10000, 28, 28), containing the test data.

y_test: uint8 NumPy array of labels (integers in range 0-9) with shape (10000,) for the test data.

Example:

(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
assert x_train.shape == (60000, 28, 28)
assert x_test.shape == (10000, 28, 28)
assert y_train.shape == (60000,)
assert y_test.shape == (10000,)
License:

The copyright for Fashion-MNIST is held by Zalando SE. Fashion-MNIST is licensed under the MIT license

Accelerometer Datasets

Tensorflow-Lite Micro Magic Wand

https://github.com/tensorflow/tflite-micro/tree/main/tensorflow/lite/micro/examples/magic_wand