ParallelAudioDataGenerator

class mltk.core.preprocess.audio.parallel_generator.ParallelAudioDataGenerator(cores=0.25, debug=False, max_batches_pending=4, get_batch_function=None, noaug_preprocessing_function=None, preprocessing_function=None, postprocessing_function=None, samplewise_center=False, samplewise_std_normalization=False, samplewise_normalize_range=None, rescale=None, validation_split=0.0, validation_augmentation_enabled=True, dtype='float32', frontend_dtype=None, trim_threshold_db=20, noise_colors=None, noise_color_range=None, speed_range=None, pitch_range=None, vtlp_range=None, loudness_range=None, bg_noise_range=None, bg_noise_dir=None, offset_range=(0.0, 1.0), unknown_class_percentage=1.0, silence_class_percentage=0.6, disable_random_transforms=False, frontend_settings=None, frontend_enabled=True, sample_shape=None, disable_gpu_in_subprocesses=True, add_channel_dimension=True)[source]

Parallel Audio Data Generator

This class as a similar functionality to the Keras ImageDataGenerator

Except, instead of processing image files it processes audio files.

Additionally, batch samples are asynchronously processed using the Python ‘multiprocessing’ package. This allows for efficient use of multi-core systems as future batch samples can be concurrently processed while processed batches may be used for training.

This class works as follows:

  1. Class instantiated with parameters (see below)

  2. flow_from_directory() called which lists each classes’ samples in the specified directory

  3. The return value of flow_from_directory() is a ‘generator’ which should be given to a model fit() method

  4. During fitting, batches of samples are concurrently processed using the following sequence:

    a0. If get_batch_function() is given, then call this function an skip the rest of these steps a. Read sample raw audio file b. If supplied, call noaug_preprocessing_function() c. Generate random transform parameters based on parameters from step 1) d. Trim silence from raw audio sample based on trim_threshold_db e. Pad zeros before and after trimmed audio based on sample_length_seconds and offset_range f. Augment padded audio based on randomly generated transform parameters from part c) g. If supplied, call preprocessing_function() h. If frontend_enabled=True, pass augmented audio through mltk.core.preprocess.audio.audio_feature_generator.AudioFeatureGenerator and return spectrogram as 2D array i. If supplied, call postprocessing_function() j. If frontend_enabled=True, normalize based on samplewise_center, samplewise_std_normalization, and rescale

Notes

  • If _unknown_ is added as a class to flow_from_directory(), then the generator will automatically add an ‘unknown’ class to the generated batches. Unused classes in the dataset directory will be randomly selected and used as an ‘unknown’ class. The other augmentation parameters will be applied to the ‘unknown’ samples. Use the unknown_class_percentage setting to control the size of this class.

  • If _silence_ is added as a class to flow_from_directory(), then the generator will automatically add ‘silence’ samples is all zeros with the background noise augmentations added. Use the silence_class_percentage setting to control the size of this class.

Parameters
  • cores – The number of CPU cores to use for spawned audio processing batch processes. This number can be either an integer, which specifies the exact number of CPU cores, or it can be a float < 1.0. The float is the percentage of CPU cores to use for processing. A large number of CPU cores will consume more system memory.

  • debug – If true then use the Python threading library instead of multiprocessing This is useful for debugging as it allows for single-stepping in the generator threads and callback functions

  • max_batches_pending – This is the number of processed batches to queue. A larger number can improving training times at the expense of increased system memory usage.

  • get_batch_function

    function that should return the transformed batch. If this is omitted, then iterator.get_batches_of_transformed_samples() is used This function should have the following signature:

    def get_batches_of_transformed_samples(
       batch_index:int,
       filenames:List[str],
       classes:List[int],
       params:ParallelProcessParams
    ) -> Tuple[int, Tuple[np.ndarray, np.ndarray]]:
        ...
        return batch_index, (batch_x, batch_y)
    

  • noaug_preprocessing_function

    function that will be applied on each input. The function will run before any augmentation is done on the audio sample. The ‘x’ argument is of the shape [sample_length] and is a float32 scaled between (-1,1). See https://librosa.org/doc/main/generated/librosa.load.html The function should take at least two arguments:

    def my_processing_func(
        params: ParallelProcessParams,
        x : np.ndarray,
        class_id: Optional[int],
        filename: Optional[str],
        batch_index: Optional[int],
        batch_class_ids: Optional[List[int]],
        batch_filenames: Optional[List[str]]
     ) -> np.ndarray:
        ...
        return processed_x
    

  • preprocessing_function

    function that will be applied on each input. The function will run after the audio is augmented but before it does through the AudioFeatureGenerator (if enabled). The ‘x’ argument is of the shape [sample_length] and is a float32 scaled between [-1,1]. See librosa.load() The function should take at least two arguments, and return the processed sample:.

    def my_processing_func(
        params: ParallelProcessParams,
        x : np.ndarray,
        class_id: Optional[int],
        filename: Optional[str],
        batch_index: Optional[int],
        batch_class_ids: Optional[List[int]],
        batch_filenames: Optional[List[str]]
     ) -> np.ndarray:
        ...
        return processed_x
    

  • postprocessing_function

    function that will be applied on each input. The function will run after the audio is passed through the AudioFeatureGenerator (if enabled). So the ‘x’ argument is a spectrogram of the shape: [height, width, 1]. The function should take at least two arguments and return the processed sample [height, width, 1]:

    def my_processing_func(
        params: ParallelProcessParams,
        x : np.ndarray,
        class_id: Optional[int],
        filename: Optional[str],
        batch_index: Optional[int],
        batch_class_ids: Optional[List[int]],
        batch_filenames: Optional[List[str]]
     ) -> np.ndarray:
        ...
        return processed_x
    

  • samplewise_center – Center each sample’s processed data about its norm

  • samplewise_std_normalization – Divide processed sample data by its STD

  • samplewise_normalize_range – Normalize the input by the values in this range For instance, if samplewise_normalize_range=(0,1), then each input will be scaled to values between 0 and 1 Note that this normalization is applied after all other enabled normalizations

  • rescale – Divide processed sample data by value

  • validation_split – Percentage of sample data to use for validation

  • validation_augmentation_enabled – If True, then augmentations will be applied to validation data. If False, then no augmentations will be applied to validation data.

  • dtype – Output data type for the x samples Default: float32

  • frontend_dtype

    Output data format of the audio frontend. If omitted then default to the dtype argument. This is only used if frontend_enabled=True.

    • uint16:* This the raw value generated by the internal AudioFeatureGenerator library

    • float32: This is the uint16 value directly casted to a float32

    • int8:* This is the int8 value generated by the TFLM “micro features” library.

    Refer to mltk.core.preprocess.audio.audio_feature_generator.AudioFeatureGenerator for more details.

  • trim_threshold_db – Use to trim silence from samples; the threshold (in decibels) below reference to consider as silence

  • noise_colors

    List of noise colors to randomly add to samples, possible options:

    • [‘white’, ‘brown’, ‘blue’, ‘pink’, ‘violet’]

    • OR ‘all’ to use all

    • OR ‘none’ to use none

  • noise_color_range – Tuple (min, max) for randomly selecting noise color’s loudness, 0.0 no noise, 1.0 100% noise

  • speed_range – Tuple (min, max) for randomly augmenting audio’s speed, < 1.0 slow down, > 1.0 speed up

  • pitch_range – Tuple (min, max) for randomly augmenting audio’s pitch, < 0 lower pitch, > 0 higher pitch This can either be an integer or float. An integer represents the number of semitone steps. A float is converted to semitone steps <float>*12, so for example, a range of (-.5,.5) is converted to (-6,6)

  • vtlp_range – Tuple (min, max) for randomly augmenting audio’s vocal tract length perturbation

  • loudness_range – Tuple (min, max) for randomly augmenting audio’s volume, < 1.0 decrease loudness, > 1.0 increase loudness

  • bg_noise_range – Tuple (min, max) for randomly selecting background noise’s loudness, < 1.0 decrease loudness, > 1.0 increase loudness

  • bg_noise_dir – Path to directory containing background noise audio files. A bg noise file will be randomly selected and cropped then applied to samples. .. note:: If noise_colors is also supplied then either a bg_noise or noise_color will randomly be applied to each sample

  • offset_range

    Tuple (min, max) for randomly selecting the offset of where to pad a sample to @ref sample_length_seconds. For instance, if offset_range=(0.0, 1.0), then

    • trimmed_audio = trim(raw_audio, trim_threshold_db) # Trim silence

    • required_padding = (sample_length_seconds * sample_rate) - len(trimmed_audio)

    • pad_upto_index = required_padding * random.uniform(offset_range[0], offset_range[1])

    • padded_audio = concat(zeros * pad_upto_index, trimmed_audio, zeros * (required_padding - pad_upto_index))

  • unknown_class_percentage – If an _unknown_ class is added to the class list, then ‘unknown’ class samples will automatically be added to batches. This specifies the percentage of of samples to generate relative the smallest number of samples of other classes. For instance, if another class has 1000 samples and unknown_class_percentage=0.8, then the number of ‘unknown’ class samples generated will be 800.

  • silence_class_percentage – If a _silence_ class is added to the class list, then ‘silence’ class samples will automatically be added to batches. This specifies the percentage of of samples to generate relative the smallest number of samples of other classes. For instance, if another class has 1000 samples and silence_class_percentage=0.8, then the number of ‘silence’ class samples generated will be 800.

  • disable_random_transforms – Disable random data augmentations

  • frontend – AudioFeatureGenerator settings, see mltk.core.preprocess.audio.audio_feature_generator.AudioFeatureGeneratorettings for more details

  • frontend_enabled – By default, the frontend is enabled. After augmenting audio sample, pass it through the AudioFeatureGenerator and return the generated spectrogram. If disabled, after augmenting audio sample, return the 1D sample. In this case it is recommended to use the postprocessing_function callback to convert the samples to the required shape and data type. NOTE: You must also specify the sample_shape parameter if frontend_enabled=False

  • sample_shape – The shape of the generated sample. This is only used/required if frontend_enabled=False

  • disable_gpu_in_subprocesses – Disable GPU usage in spawned subprocesses, default: true

  • add_channel_dimension – If true and frontend_enabled=True, then automatically convert generated sample shape from [height, width] to [height, width, 1]. If false, then generated sample shape is [height, width].

Properties

default_transform

Retrun the default augmentations transform settings

sample_length

Return the length of the audio sample as the number of individual ADC samples

sample_length_ms

Return the AudioFeatureGeneratorSettings.sample_length_ms value

sample_rate_hz

Return the AudioFeatureGeneratorSettings.sample_rate_hz value

sample_shape

The shape of the sample as a tuple

__init__(cores=0.25, debug=False, max_batches_pending=4, get_batch_function=None, noaug_preprocessing_function=None, preprocessing_function=None, postprocessing_function=None, samplewise_center=False, samplewise_std_normalization=False, samplewise_normalize_range=None, rescale=None, validation_split=0.0, validation_augmentation_enabled=True, dtype='float32', frontend_dtype=None, trim_threshold_db=20, noise_colors=None, noise_color_range=None, speed_range=None, pitch_range=None, vtlp_range=None, loudness_range=None, bg_noise_range=None, bg_noise_dir=None, offset_range=(0.0, 1.0), unknown_class_percentage=1.0, silence_class_percentage=0.6, disable_random_transforms=False, frontend_settings=None, frontend_enabled=True, sample_shape=None, disable_gpu_in_subprocesses=True, add_channel_dimension=True)[source]

Methods

__init__

adjust_length

Adjust the audio sample length to fit the sample_length_seconds parameter This will pad with zeros or crop the input sample as necessary

apply_frontend

Send the audio sample through the AudioFeatureGenerator and return the generated spectrogram

apply_transform

Apply the given transform parameters to the input audio sample

flow_from_directory

Create the ParallelAudioGenerator with the given dataset directory

get_random_transform

Generate random augmentation settings based on the configeration parameters

standardize

Applies the normalization configuration in-place to a batch of inputs.

property sample_shape: tuple

The shape of the sample as a tuple

Return type

tuple

property sample_length: int

Return the length of the audio sample as the number of individual ADC samples

Return type

int

property sample_length_ms: int

Return the AudioFeatureGeneratorSettings.sample_length_ms value

Return type

int

property sample_rate_hz: int

Return the AudioFeatureGeneratorSettings.sample_rate_hz value

Return type

int

flow_from_directory(directory, classes, class_mode='categorical', batch_size=32, shuffle=True, shuffle_index_dir=None, seed=None, follow_links=False, subset=None, max_samples_per_class=- 1, list_valid_filenames_in_directory_function=None, class_counts=None, **kwargs)[source]

Create the ParallelAudioGenerator with the given dataset directory

Takes the path to a directory & generates batches of augmented data.

Parameters
  • directory – string, path to the target directory. It should contain one subdirectory per class. Any PNG, JPG, BMP, PPM or TIF images inside each of the subdirectories directory tree will be included in the generator.

  • classes

    Required, list of class subdirectories (e.g. [‘dogs’, ‘cats’])

    • If _unknown_ is added as a class then the generator will automatically add an ‘unknown’ class to the generated batches. Unused classes in the dataset directory will be randomly selected and used as an ‘unknown’ class. The other augmentation parameters will be applied to the ‘unknown’ samples.

    • If _silence_ is added as a class then the generator will automatically add ‘silence’ samples with all zeros with the background noise augmentations added.

  • class_mode

    One of “categorical”, “binary”, “sparse”, “input”, or None. Default: “categorical”. Determines the type of label arrays that are returned:

    • categorical will be 2D one-hot encoded labels,

    • binary will be 1D binary labels, “sparse” will be 1D integer labels,

    • input will be images identical to input images (mainly used to work with autoencoders).

    • None no labels are returned (the generator will only yield batches of image data, which is useful to use with model.predict()).

    Please note that in case of class_mode None, the data still needs to reside in a subdirectory of directory for it to work correctly.

  • batch_size – Size of the batches of data (default: 32).

  • shuffle – Whether to shuffle the data (default: True) If set to False, sorts the data in alphanumeric order.

  • shuffle_index_dir – If given, the dataset directory will be shuffled the first time it is processed and and an index file containing the shuffled file names is generated at the directory specified by shuffle_index_dir. The index file is reused to maintain the shuffled order for subsequent processing. If None, then the dataset samples are sorted alphabetically and saved to an index file in the dataset directory. The alphabetical index file is used for subsequent processing. Default: None

  • seed – Optional random seed for shuffling and transformations.

  • follow_links – Whether to follow symlinks inside class subdirectories (default: False).

  • subset – Subset of data (“training” or “validation”) if validation_split is set in ParallelAudioDataGenerator.

  • max_samples_per_class – The maximum number of samples to use for a given class. If -1 then use all available samples.

  • list_valid_filenames_in_directory_function

    This is a custom function called for each class, that should return a list of valid file names for the given class. It has the following function signature:

    def list_valid_filenames_in_directory(
            base_directory:str,
            search_class:str,
            white_list_formats:List[str],
            split:Tuple[float,float],
            follow_links:bool,
            shuffle_index_directory:str
    ) -> Tuple[str, List[str]]
        ...
        return search_class, filenames
    

Returns

A DirectoryIterator yielding tuples of (x, y) where x is a numpy array containing a batch of images with shape (batch_size, target_size, channels) and y is a numpy array of corresponding labels.

property default_transform: dict

Retrun the default augmentations transform settings

Return type

dict

get_random_transform()[source]

Generate random augmentation settings based on the configeration parameters

Return type

dict

adjust_length(sample, orignal_sr, offset=0.0, whole_sample=False, out_length=None)[source]

Adjust the audio sample length to fit the sample_length_seconds parameter This will pad with zeros or crop the input sample as necessary

apply_transform(sample, orignal_sr, params, whole_sample=False)[source]

Apply the given transform parameters to the input audio sample

apply_frontend(sample, dtype=<class 'numpy.float32'>)[source]

Send the audio sample through the AudioFeatureGenerator and return the generated spectrogram

Return type

ndarray

standardize(sample)[source]

Applies the normalization configuration in-place to a batch of inputs.

Parameters
  • sample – Input sample to normalize

  • rescalesample *= rescale

  • samplewise_centersample -= np.mean(sample, keepdims=True)

  • samplewise_std_normalizationsample /= (np.std(sample, keepdims=True) + 1e-6)

  • samplewise_normalize_rangesample = diff * (sample - np.min(sample)) / np.ptp(sample) + lower

  • dtype – The output dtype, if not dtype if given then sample is converted to float32

Returns

The normalized value of sample