mltk.core.preprocess.utils.tf_dataset

Utilities for processing Tensorflow Datasets

Functions

enable_numpy_behavior()

Enable NumPy behavior on Tensors.

load_audio_directory(directory, classes[, ...])

Load a directory of audio samples and return a tuple of Tensorflow Datasets (samples, label_ids)

load_image_directory(directory, classes[, ...])

Load a directory of images samples and return a tuple of Tensorflow Datasets (samples, label_ids)

parallel_process(dataset, callback[, dtype, ...])

Parallel process the dataset

load_audio_directory(directory, classes, unknown_class_percentage=1.0, silence_class_percentage=1.0, class_counts=None, onehot_encode=False, shuffle=False, seed=None, split=None, max_samples_per_class=-1, sample_rate_hz=None, return_audio_data=False, return_audio_sample_rate=False, white_list_formats=None, follow_links=False, shuffle_index_directory=None, list_valid_filenames_in_directory_function=None, process_samples_function=None)[source]

Load a directory of audio samples and return a tuple of Tensorflow Datasets (samples, label_ids)

The given audio directory should have the structure:

<class1>/sample1.wav
<class1>/sample2.wav
...
<class1>/optional sub directory/sample9.wav
<class2>/sample1.wav
<class2>/sample2.wav
...
<class3>/sample1.wav
<class3>/sample2.wav

Where each <class> is found in the given classes argument.

See also

See the Tensor Dataset API for more details of how to use the returned datasets

Parameters:
  • directory (str) – Directory path to audio dataset

  • classes (List[str]) –

    List of class labels to include in the returned dataset

    • If _unknown_ is added as an entry to the classes, then this API will automatically add an ‘unknown’ class to the generated batches. Unused classes in the dataset directory will be randomly selected and used as an ‘unknown’ class. Use the unknown_class_percentage setting to control the size of this class.

    • If _silence_ is added as an entry to the classes, then this API will automatically add ‘silence’ samples with all zeros. Use the silence_class_percentage setting to control the size of this class.

  • unknown_class_percentage (float) – If an _unknown_ class is added to the class list, then ‘unknown’ class samples will automatically be added to batches. This specifies the percentage of of samples to generate relative the smallest number of samples of other classes. For instance, if another class has 1000 samples and unknown_class_percentage=0.8, then the number of ‘unknown’ class samples generated will be 800. Set this parameter to None to disable this feature

  • silence_class_percentage (float) – If a _silence_ class is added to the class list, then ‘silence’ class samples will automatically be added to batches. This specifies the percentage of of samples to generate relative the smallest number of samples of other classes. For instance, if another class has 1000 samples and silence_class_percentage=0.8, then the number of ‘silence’ class samples generated will be 800. Set this parameter to None to disable this feature

  • class_counts (Dict[str, int]) – Dictionary which will be populated with the sample counts for each class

  • onehot_encode – If true then the audio labels are onehot-encoded If false, then only the class id (corresponding to it index in the classes argument) is returned

  • shuffle (bool) – If true, then shuffle the dataset

  • seed – The seed to use for shuffling the dataset

  • split (Tuple[float, float]) – A tuple indicating the (start,stop) percentage of the dataset to return, e.g. (.75, 1.0) -> return last 25% of dataset If omitted then return the entire dataset

  • max_samples_per_class (int) – Maximum number of samples to return per class, this can be useful for debug to limit the dataset size

  • sample_rate_hz (int) – Sample rate to convert audio samples, if omitted then return native sample rate

  • return_audio_data – If true then the audio file data is returned, if false then only the audio file path is returned

  • return_audio_sample_rate – If true and return_audio_data is True, then the audio file data and corresponding sample rate is returned, if false then only the audio file data is return

  • white_list_formats (List[str]) – List of file extensions to include in the search. If omitted then only return .wav files

  • follow_links – If true then follow symbolic links when recursively searching the given dataset directory

  • shuffle_index_directory (str) – Path to directory to hold generated index of the dataset If omitted, then an index is generated at <directory>/.index

  • list_valid_filenames_in_directory_function (Callable) –

    This is a custom dataset processing callback. It should return a list of valid file names for the given class. It has the following function signature:

    def list_valid_filenames_in_directory(
            base_directory:str,
            search_class:str,
            white_list_formats:List[str],
            split:Tuple[float,float],
            follow_links:bool,
            shuffle_index_directory:str
    ) -> Tuple[str, List[str]]
        ...
        return search_class, filenames
    

  • process_samples_function (Callable[[str, Dict[str, str]], None]) –

    This allows for processing the samples BEFORE they’re returned by this API. This allows for adding/removing samples. It has the following function signature:

    def process_samples(
        directory:str, # The provided directory to this API
        sample_paths:Dict[str,str], # A dictionary: <class name>, [<sample paths relative to directory>]
        split:Tuple[float,float],
        follow_links:bool,
        white_list_formats:List[str],
        shuffle:bool,
        seed:int,
        **kwargs
    )
        ...
    

Return type:

Tuple[DatasetV2, DatasetV2]

Returns:

Returns a tuple of two tf.data.Dataset, (samples, label_ids)

load_image_directory(directory, classes, unknown_class_percentage=1.0, class_counts=None, onehot_encode=False, shuffle=False, seed=None, split=None, max_samples_per_class=-1, return_image_data=False, white_list_formats=None, follow_links=False, shuffle_index_directory=None, list_valid_filenames_in_directory_function=None, process_samples_function=None)[source]

Load a directory of images samples and return a tuple of Tensorflow Datasets (samples, label_ids)

The given images directory should have the structure:

<class1>/sample1.png
<class1>/sample2.jpg
...
<class1>/optional sub directory/sample9.png
<class2>/sample1.jpg
<class2>/sample2.jpg
...
<class3>/sample1.jpg
<class3>/sample2.jpg

Where each <class> is found in the given classes argument.

See also

See the Tensor Dataset API for more details of how to use the returned datasets

Parameters:
  • directory (str) – Directory path to images dataset

  • classes (List[str]) –

    List of class labels to include in the returned dataset

    • If _unknown_ is added as an entry to the classes, then this API will automatically add an ‘unknown’ class to the generated batches. Unused classes in the dataset directory will be randomly selected and used as an ‘unknown’ class. Use the unknown_class_percentage setting to control the size of this class.

  • unknown_class_percentage (float) – If an _unknown_ class is added to the class list, then ‘unknown’ class samples will automatically be added to batches. This specifies the percentage of of samples to generate relative the smallest number of samples of other classes. For instance, if another class has 1000 samples and unknown_class_percentage=0.8, then the number of ‘unknown’ class samples generated will be 800. Set this parameter to None to disable this feature

  • class_counts (Dict[str, int]) – Dictionary which will be populated with the sample counts for each class

  • onehot_encode – If true then the audio labels are onehot-encoded If false, then only the class id (corresponding to it index in the classes argument) is returned

  • shuffle (bool) – If true, then shuffle the dataset

  • seed – The seed to use for shuffling the dataset

  • split (Tuple[float, float]) – A tuple indicating the (start,stop) percentage of the dataset to return, e.g. (.75, 1.0) -> return last 25% of dataset If omitted then return the entire dataset

  • max_samples_per_class (int) – Maximum number of samples to return per class, this can be useful for debug to limit the dataset size

  • return_image_data – If true then the image file data is returned, if false then only the image file path is returned

  • white_list_formats (List[str]) – List of file extensions to include in the search. If omitted then only return .png, .jpg files

  • follow_links – If true then follow symbolic links when recursively searching the given dataset directory

  • shuffle_index_directory (str) – Path to directory to hold generated index of the dataset If omitted, then an index is generated at <directory>/.index

  • list_valid_filenames_in_directory_function

    This is a custom dataset processing callback. It should return a list of valid file names for the given class. It has the following function signature:

    def list_valid_filenames_in_directory(
            base_directory:str,
            search_class:str,
            white_list_formats:List[str],
            split:Tuple[float,float],
            follow_links:bool,
            shuffle_index_directory:str
    ) -> Tuple[str, List[str]]
        ...
        return search_class, filenames
    

  • process_samples_function

    This allows for processing the samples BEFORE they’re returned by this API. This allows for adding/removing samples. It has the following function signature:

    def process_samples(
        directory:str, # The provided directory to this API
        sample_paths:Dict[str,str], # A dictionary: <class name>, [<sample paths relative to directory>]
        split:Tuple[float,float],
        follow_links:bool,
        white_list_formats:List[str],
        shuffle:bool,
        seed:int,
        **kwargs
    )
        ...
    

Return type:

Tuple[DatasetV2, DatasetV2]

Returns:

Returns a tuple of two tf.data.Dataset, (samples, label_ids)

parallel_process(dataset, callback, dtype=<class 'numpy.float32'>, n_jobs=4, job_batch_size=None, pool=None, name='ParallelProcess', env=None, disable_gpu_in_subprocesses=True)[source]

Parallel process the dataset

This will invoke the given callback across the available CPUs in the system which can greatly improve throughput.

Note

This uses the tf.numpy_function API which can slow processing in some instances.

Parameters:
  • dataset (DatasetV2) – The Tensorflow dataset to parallel process

  • callback (Callable) – The callback to invoke in parallel processes This callback must be at the root of the python module (i.e. it cannot be nested or a class method)

  • dtype (Union[dtype, Tuple[dtype]]) – The data type that the callback returns, this can also be a list of dtypes if the callback returns multiple np.ndarrays

  • n_jobs (Union[int, float]) – The number of jobs (i.e. CPU cores) to use for processing This can either be an integer or a float between (0,1.0]

  • job_batch_size (int) – This size of the batches to use for processing If omitted, then use the calculated n_jobs

  • pool (ProcessPool) – An existing processing pool. If omitted then create a new pool

  • name – The prefix to use in the model graph

  • env (Dict[str, str]) – Optional OS environment variables to export in the parallel subprocesses

  • disable_gpu_in_subprocesses – By default the GPU is disabled in the parallel subprocesses

Return type:

Tuple[DatasetV2, ProcessPool]

Returns:

(tf.data.Dataset, ProcessPool), a tuple of the updated dataset with the parallel processing and the associated process pool

enable_numpy_behavior()[source]

Enable NumPy behavior on Tensors.

NOTE: This requires Tensorflow 2.5+

Enabling NumPy behavior has three effects:

  • It adds to tf.Tensor some common NumPy methods such as T, reshape and ravel.

  • It changes dtype promotion in tf.Tensor operators to be compatible with NumPy. For example, tf.ones([], tf.int32) + tf.ones([], tf.float32) used to throw a “dtype incompatible” error, but after this it will return a float64 tensor (obeying NumPy’s promotion rules).

  • It enhances tf.Tensor’s indexing capability to be on par with NumPy’s.

Refer to the Tensorflow docs for more details.

Return type:

bool

Returns:

True if the numpy behavior was enabled, False else