mltk.core.preprocess.utils.list_directory

Utilities for listing dataset directories

Functions

list_dataset_directory(directory, classes[, ...])

Load a directory of samples and return a tuple of lists (sample paths, label_ids)

list_valid_filenames_in_directory(...[, ...])

File all files in the search directory for the specified class

shuffle_file_list_by_group(paths, group_callback)

Shuffle the given file list by group

split_file_list(paths[, split])

Split list of file paths

list_dataset_directory(directory, classes, unknown_class_percentage=1.0, unknown_class_label=None, empty_class_percentage=1.0, empty_class_label=None, class_counts=None, shuffle=False, seed=None, split=None, max_samples_per_class=-1, white_list_formats=None, follow_links=False, shuffle_index_directory=None, return_absolute_paths=False, list_valid_filenames_in_directory_function=None, process_samples_function=None)[source]

Load a directory of samples and return a tuple of lists (sample paths, label_ids)

The given directory should have the structure:

<class1>/sample1
<class1>/sample2
...
<class1>/optional sub directory/sample9
<class2>/sample1
<class2>/sample2
...
<class3>/sample1
<class3>/sample2

Where each <class> is found in the given classes argument.

Parameters:
  • directory (str) – Directory path to audio dataset

  • classes (List[str]) –

    List of class labels to include in the returned dataset

    • If unknown_class_label is added as an entry to the classes, then this API will automatically add an ‘unknown’ class to the generated batches. Unused classes in the dataset directory will be randomly selected and used as an ‘unknown’ class. Use the unknown_class_percentage setting to control the size of this class.

    • If empty_class_label is added as an entry to the classes, then this API will automatically add ‘empty’ samples with all zeros. Use the empty_class_percentage setting to control the size of this class.

  • unknown_class_percentage (float) – If an unknown_class_label class is added to the class list, then ‘unknown’ class samples will automatically be added to batches. This specifies the percentage of of samples to generate relative the smallest number of samples of other classes. For instance, if another class has 1000 samples and unknown_class_percentage=0.8, then the number of ‘unknown’ class samples generated will be 800. Set this parameter to None to disable this feature

  • unknown_class_label – Class label to be considered “unknown”. See the classes arg for more details

  • empty_class_percentage (float) – If a empty_class_label class is added to the class list, then ‘silence’ class samples will automatically be added to batches. This specifies the percentage of of samples to generate relative the smallest number of samples of other classes. For instance, if another class has 1000 samples and empty_class_percentage=0.8, then the number of ‘empty’ class samples generated will be 800. Set this parameter to None to disable this feature

  • empty_class_label (str) – Class label to be considered “empty”. See the classes arg for more details

  • class_counts (Dict[str, int]) – Dictionary which will be populated with the sample counts for each class

  • shuffle – If true, then shuffle the dataset

  • seed – The seed to use for shuffling the dataset

  • split (Tuple[float, float]) – A tuple indicating the (start,stop) percentage of the dataset to return, e.g. (.75, 1.0) -> return last 25% of dataset If omitted then return the entire dataset

  • max_samples_per_class – Maximum number of samples to return per class, this can be useful for debug to limit the dataset size

  • return_audio_data – If true then the audio file data is returned, if false then only the audio file path is returned

  • white_list_formats (List[str]) – List of file extensions to include in the search.

  • follow_links – If true then follow symbolic links when recursively searching the given dataset directory

  • shuffle_index_directory (str) – Path to directory to hold generated index of the dataset If omitted, then an index is generated at <directory>/.index

  • return_absolute_paths (bool) – If true then return absolute paths to samples, if false then paths are relative to the given directory

  • list_valid_filenames_in_directory_function

    This is a custom dataset processing callback. It should return a list of valid file names for the given class. It has the following function signature:

    def list_valid_filenames_in_directory(
        base_directory:str,
        search_class:str,
        white_list_formats:List[str],
        split:Tuple[float,float],
        follow_links:bool,
        shuffle_index_directory:str,
    ) -> Tuple[str, List[str]]
        ...
        return search_class, filenames
    

  • process_samples_function

    This allows for processing the samples BEFORE they’re returned by this API. This allows for adding/removing samples. It has the following function signature:

    def process_samples(
        directory:str, # The provided directory to this API
        sample_paths:Dict[str,str], # A dictionary: <class name>, [<sample paths relative to directory>]
        split:Tuple[float,float],
        follow_links:bool,
        white_list_formats:List[str],
        shuffle:bool,
        seed:int,
        **kwargs
    )
        ...
    

Return type:

Tuple[List[str], List[int]]

Returns:

Returns a tuple of two lists, (samples paths, label_ids)

list_valid_filenames_in_directory(base_directory, search_class, white_list_formats=None, split=None, follow_links=False, shuffle_index_directory=None)[source]

File all files in the search directory for the specified class

if shuffle_index_directory is None:

then sort the filenames alphabetically and save to the list file: <base_directory>/.index/<search_class>.txt

else:

then randomly shuffle the files and save to the list file: <shuffle_index_directory>/.index/<search_class>.txt

Parameters:
  • base_directory (str) – Search directory for the current class

  • search_class (str) – Label of class to search

  • white_list_formats (List[str]) – List of file extensions to search

  • split (Tuple[float, float]) – A tuple indicating the (start,stop) percentage of the dataset to return, e.g. (.75, 1.0) -> return last 25% of dataset If omitted then return the entire dataset

  • follow_links (bool) – If true then follow symbolic links when recursively searching the given dataset directory

  • shuffle_index_directory (str) – Path to directory to hold generated index of the dataset

Return type:

Tuple[str, List[str]]

Returns:

(search_class, list(relative paths), a tuple of the given search_class and list of file paths relative to the base_directory

split_file_list(paths, split=None)[source]

Split list of file paths

Parameters:
  • paths (List[str]) – List of file paths

  • split (Tuple[float, float]) – A tuple indicating the (start,stop) percentage of the dataset to return, e.g. (.75, 1.0) -> return last 25% of dataset If omitted then return the entire dataset

Return type:

List[str]

Returns:

Split file paths

shuffle_file_list_by_group(paths, group_callback, seed=42)[source]

Shuffle the given file list by group

This uses the given ‘group_callback’ argument to determine the “group” that each file path in the given list belongs. It then shuffles each group and returns the shuffles groups as a flat list.

This is useful as it allows for splitting the list into training and validation subsets while ensuring that the same group does not appear in both subsets.

Parameters:
  • paths (List[str]) – List of file paths

  • group_callback (Callable[[str], str]) – Callback that takes an element of the given ‘paths’ array and returns its corresponding “group”

  • seed (int) – Optional seed used to do that random shuffle

Return type:

List[str]

Returns:

Shuffle list of groups of files