mltk.core.ImageDatasetMixin¶

class ImageDatasetMixin[source]¶

Provides image dataset properties to the base MltkModel

Properties

`class_counts`	Dictionary of samples counts for each class
`class_mode`	Determines the type of label arrays that are returned.
`class_weights`	Specifies how class weights should be calculated.
`color_mode`	The type of image data to use
`datagen`	Training data generator.
`datagen_context`	Loaded data generator's context
`dataset`	Path to the image dataset's python module, a function that manually loads the dataset, or the file path to a directory of samples.
`follow_links`	Whether to follow symlinks inside class sub-directories
`image_classes`	Return a list of class labels the model should classify
`image_input_shape`	Return the image input shape as a tuple of integers
`interpolation`	Interpolation method used to resample the image if the target size is different from that of the loaded image
`loaded_subset`	training, validation, evaluation
`sample_weight`	Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only).
`shuffle`	Boolean (whether to shuffle the training data before each epoch) or str (for 'batch').
`shuffle_dataset_enabled`	Shuffle the dataset directory once
`steps_per_epoch`	Integer or None.
`target_size`	Return the target size of the generated images.
`validation_batch_size`	Integer or None.
`validation_data`	Data on which to evaluate the loss and any model metrics at the end of each epoch.
`validation_datagen`	Validation/evaluation data generator.
`validation_freq`	Only relevant if validation data is provided.
`validation_split`	Float between 0 and 1 Fraction of the training data to be used as validation data.
`validation_steps`	Only relevant if validation_data is provided and is a tf.data dataset.
`x`	Input data
`y`	Target data

Methods

`__init__`
`get_datagen_creator`	Return an object that creates a data generator for the given subset
`get_shuffle_index_dir`	The ParallelImageGenerator and ParallelImageGenerator have the option to shuffle the dataset entries once before they're used.
`load_dataset`	Pre-process the dataset and prepare the model dataset attributes
`summarize_dataset`	Summarize the dataset
`unload_dataset`	Unload the dataset

property dataset¶

Path to the image dataset’s python module, a function that manually loads the dataset, or the file path to a directory of samples.

If a Python module is provided, it must implement the function:

def load_data():
   ...

The load_data() function should either return a tuple as: (x_train, y_train), (x_test, y_test) OR it should return the path to a directory containing the dataset’s samples.

If a function is provided, the function should return the tuple: (x_train, y_train), (x_test, y_test) OR it should return the path to a directory containing the dataset’s samples.

property follow_links¶

Whether to follow symlinks inside class sub-directories

Default: True

property shuffle_dataset_enabled¶

Shuffle the dataset directory once

Default: false

If true, the dataset directory will be shuffled the first time it is processed and
and an index containing the shuffled file names is generated in the training log directory. The index is reused to maintain the shuffled order for subsequent processing.
If false, then the dataset samples are sorted alphabetically and saved to an index in the dataset directory.
The alphabetical index file is used for subsequent processing.

property image_classes¶: Return a list of class labels the model should classify

property image_input_shape¶: Return the image input shape as a tuple of integers

property target_size¶: Return the target size of the generated images. The image data generator will automatically resize all images to this size. If omitted, my_model.input_shape is used.

Note

This is only used if providing a directory image dataset

property class_mode¶

Determines the type of label arrays that are returned. Default: categorical

categorical - 2D one-hot encoded labels
binary - 1D binary labels
sparse - 1D integer labels
input - images identical to input images (mainly used to work with autoencoders)

property color_mode¶

The type of image data to use

Default: auto

May be one of the following:

auto - Automatically determine the color mode based on the input shape channels
grayscale - Convert the images to grayscale (if necessary). The put shape must only have 1 channel
rgb - The input shape must only have 3 channels
rgba - The input shape must have 4 channels

property interpolation¶

Interpolation method used to resample the image if the target size is different from that of the loaded image

Default: bilinear

Supported methods are none, nearest, bilinear, bicubic, lanczos, box and hamming . If none is used then the generated images are not automatically resized. In this case, the mltk.core.preprocess.image.parallel_generator.ParallelImageDataGenerator preprocessing_function argument should be used to reshape the image to the expected model input shape.

property class_counts¶

Dictionary of samples counts for each class

This is used for generating a summary of the dataset or when calculating class weights when my_model.class_weights=balanced.

The dictionary may contain sub-dictionaries for each subset of the dataset, e.g.:

my_model.class_counts = dict(
    training = dict(
        cat = 100,
        dog = 200,
        goat = 500
    ),
    validation = dict(
        cat = 10,
        dog = 20,
        goat = 50
    ),
    evaluation = dict(
        cat = 10,
        dog = 20,
        goat = 50
    )
)

Or it may contain just class/counts, e.g.:

my_model.class_counts = dict(
    cat = 100,
    dog = 200,
    goat = 500
)

property class_weights¶

Specifies how class weights should be calculated. Default: None

This can be useful to tell the model to “pay more attention” to samples from an under-represented class.

May be one of the following:

If balanced is given, class weights will be given by: n_samples / (n_classes * np.bincount(y))
If a dictionary is given, keys are classes and values are corresponding class weights.
If None is given, the class weights will be uniform.

property datagen¶

Training data generator.

Should be a reference to a mltk.core.preprocess.image.parallel_generator.ParallelImageDataGenerator instance OR tensorflow.keras.preprocessing.image.ImageDataGenerator

property datagen_context¶: Loaded data generator’s context

get_datagen_creator(subset)¶

Return an object that creates a data generator for the given subset

Parameters:: subset (str) –

get_shuffle_index_dir()¶

The ParallelImageGenerator and ParallelImageGenerator have the option to shuffle the dataset entries once before they’re used. The shuffled indices are then saved to a file. The saved indices file is added to the generated model archive. This function loads the indices file from the archive during evaluation and validation.

Note

We do NOT want to shuffle during eval/validation so that results are reproducible (hence we use the one-time-generated indices file)

Return type:: str

property loaded_subset¶

training, validation, evaluation

Type:: The currently loaded dataset subset

property sample_weight¶: Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only). You can either pass a flat (1D) Numpy array with the same length as the input samples (1:1 mapping between weights and samples), or in the case of temporal data, you can pass a 2D array with shape (samples, sequence_length), to apply a different weight to every timestep of every sample. This argument is not supported when x is a dataset, generator, or keras.utils.Sequence instance, instead provide the sample_weights as the third element of x.

property shuffle¶: Boolean (whether to shuffle the training data before each epoch) or str (for ‘batch’). This argument is ignored when x is a generator. ‘batch’ is a special option for dealing with the limitations of HDF5 data; it shuffles in batch-sized chunks. Has no effect when steps_per_epoch is not None.

property steps_per_epoch¶: Integer or None. Total number of steps (batches of samples) before declaring one epoch finished and starting the next epoch. When training with input tensors such as TensorFlow data tensors, the default None is equal to the number of samples in your dataset divided by the batch size, or 1 if that cannot be determined. If x is a tf.data dataset, and ‘steps_per_epoch’ is None, the epoch will run until the input dataset is exhausted. When passing an infinitely repeating dataset, you must specify the steps_per_epoch argument. This argument is not supported with array inputs.

summarize_dataset()¶

Summarize the dataset

If my_model.dataset is provided then this attempts to call my_model.dataset.summarize_dataset(). If my_model.dataset is not provided or does not have the summarize_dataset() method, then this attempts to generate a summary from my_model.class_counts.

Return type:: str

unload_dataset()¶: Unload the dataset

property validation_batch_size¶: Integer or None. Number of samples per validation batch. If unspecified, will default to batch_size. Do not specify the validation_batch_size if your data is in the form of datasets, generators, or keras.utils.Sequence instances (since they generate batches).

property validation_data¶

Data on which to evaluate the loss and any model metrics at the end of each epoch. The model will not be trained on this data. Thus, note the fact that the validation loss of data provided using validation_split or validation_data is not affected by regularization layers like noise and dropout. validation_data will override validation_split. validation_data could be:

tuple (x_val, y_val) of Numpy arrays or tensors
tuple (x_val, y_val, val_sample_weights) of Numpy arrays
dataset For the first two cases, batch_size must be provided. For the last case, validation_steps could be provided. Note that validation_data does not support all the data types that are supported in x, eg, dict, generator or keras.utils.Sequence.

property validation_freq¶: Only relevant if validation data is provided. Integer or collections_abc.Container instance (e.g. list, tuple, etc.). If an integer, specifies how many training epochs to run before a new validation run is performed, e.g. validation_freq=2 runs validation every 2 epochs. If a Container, specifies the epochs on which to run validation, e.g. validation_freq=[1, 2, 10] runs validation at the end of the 1st, 2nd, and 10th epochs.

property validation_split¶: Float between 0 and 1 Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling. This argument is not supported when x is a dataset, generator or keras.utils.Sequence instance.

property validation_steps¶: Only relevant if validation_data is provided and is a tf.data dataset. Total number of steps (batches of samples) to draw before stopping when performing validation at the end of every epoch. If ‘validation_steps’ is None, validation will run until the validation_data dataset is exhausted. In the case of an infinitely repeated dataset, it will run into an infinite loop. If ‘validation_steps’ is specified and only part of the dataset will be consumed, the evaluation will start from the beginning of the dataset at each epoch. This ensures that the same validation samples are used every time.

property x¶

Input data

It could be:

A Numpy array (or array-like), or a list of arrays (in case the model has multiple inputs).
A TensorFlow tensor, or a list of tensors (in case the model has multiple inputs).
A dict mapping input names to the corresponding array/tensors, if the model has named inputs.
A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights).
A generator or keras.utils.Sequence returning (inputs, targets) or (inputs, targets, sample_weights). A more detailed description of unpacking behavior for iterator types (Dataset, generator, Sequence) is given below.

property y¶

Target data

Like the input data x, it could be either Numpy array(s) or TensorFlow tensor(s). It should be consistent with x (you cannot have Numpy inputs and tensor targets, or inversely). If x is a dataset, generator, or keras.utils.Sequence instance, y should not be specified (since targets will be obtained from x).

property validation_datagen¶

Validation/evaluation data generator.

If omitted, then datagen is used for validation and evaluation.

Should be a reference to a mltk.core.preprocess.image.parallel_generator.ParallelImageDataGenerator instance OR tensorflow.keras.preprocessing.image.ImageDataGenerator

load_dataset(subset, classes=None, max_samples_per_class=-1, test=False, **kwargs)[source]¶

Pre-process the dataset and prepare the model dataset attributes

Parameters:

subset (str) –
classes (List[str]) –
max_samples_per_class (int) –
test (bool) –