MltkModel

class mltk.core.MltkModel(model_script_path=None)[source]

The root MLTK Model object

This must be defined in a model specification file.

Refer to the Model Specification guide fore more details.

property attributes

Return all model attributes

get_attribute(name)[source]

Return attribute value or None if unknown

property cli

Custom command CLI

This is used to register custom commands The commands may be invoked with: mltk custom <model name> [command args]

property model_specification_path

Return the absolute path to the model’s specification python script

property name

The name of this model, this the filename of the model’s Python script.

property version

The model version, e.g. 3

property description

A description of this model and how it should be use. This is added to the .tflite model flatbuffer “description” field

property log_dir

Path to directory where logs will be generated

create_log_dir(suffix='', delete_existing=False)[source]

Create a directory for storing model log files

Return type

str

create_logger(name, parent=None)[source]

Create a logger for this model

Return type

Logger

property h5_log_dir_path

Path to .h5 model file that is generated in the log directory at the end of training

property tflite_log_dir_path

Path to .tflite model file that is generated in the log directory at the end of training (if quantization is enabled)

property unquantized_tflite_log_dir_path

Path to unquantized/float32 .tflite model file that is generated in the log directory at the end of training (if enabled)

property classes

Return a list of the class name strings this model expects

property n_classes

Return the number of classes this model expects

property input_shape

Return the image input shape as a tuple of integers

property keras_custom_objects

Get/set custom objects that should be loaded with the Keras model

See https://keras.io/guides/serialization_and_saving/#custom-objects for more details.

property archive_path

Return path to model archive file (.mdk.zip)

property h5_archive_path

Return path to .h5 model file automatically extracted from model archive file (.mdk.zip)

property model_parameters

Dictionary of model parameters to include in the generated .tflite

property test_mode_enabled

Return if testing mode has been enabled

property tflite_archive_path

Return path to .tflite model file automatically extracted from model’s archive file (.mdk.zip)

property tflite_metadata_entries

Return a list of registered metadata entries that will be included in the generated .tflite

enable_test_mode()[source]

Enable testing mode

summary()[source]

Return a summary of the model

Return type

str

class mltk.core.model.model_attributes.MltkModelAttributes[source]

Container to hold the various attributes of a MltkModel

register(key, value=None, readonly=False, dtype=None, override=False, normalize=None, setter=None)[source]

Register an attribute

contains(key)[source]

Return if an attribute with the given keep has been previously registered

Return type

bool

get_value(key, **kwargs)[source]

Return the value of the attribute with the given key

set_value(key, value)[source]

Set the value of an attribute with the given key

value_is_set(key)[source]

Return if the value of the attribute with the given key has been previously set

TrainMixin

class mltk.core.TrainMixin[source]

Provides training properties and methods to the base MltkModel

Refer to te Model Training guide for more details.

property build_model_function

Function that builds and returns a compiled mltk.core.KerasModel instance

Your model definition MUST provide this setting.

# Create a MltkModel instance with the 'train' mixin
class MyModel(
    MltkModel,
    TrainMixin,
    ImageDatasetMixin,
    EvaluateClassifierMixin
):
    pass
mltk_model = MyModel()

# Define the model build function
def my_model_builder(mltk_model):
    keras_model = Sequential()
    keras_model.add(Conv2D(8, kernel_size=(3,3), padding='valid', input_shape=mltk_model.input_shape))
    keras_model.add(Flatten())
    keras_model.add(Dense(mltk_model.n_classes, activation='softmax'))

    keras_model.compile(loss=mltk_model.loss, optimizer=mltk_model.optimizer, metrics=mltk_model.metrics)

    return keras_model

# Set the MltkModel's build_model function
mltk_model.build_model_function = my_model_builder
property on_training_complete

Callback to be invoked after the model has been successfully trained

def _on_training_completed(results:TrainingResults):
    ...

my_model.on_training_complete = _on_training_completed

Note

This is invoked after the Keras and .tflite model files are saved

property on_save_keras_model

Callback to be invoked after the model has been trained to save the KerasModel.

This callback may be used to modified the KerasModel that gets saved, e.g. Remove layers of the model that were used for training.

def _on_save_keras_model(mltk_model:MltkModel, keras_model:KerasModel, logger:logging.Logger) -> KerasModel:
    ...

my_model.on_save_keras_model = _on_save_keras_model

Note

This is invoked before the model is quantized. Quantization will use the KerasModel returned by this callback.

property epochs

Number of epochs to train the model.

Default: 100

An epoch is an iteration over the entire x and y data provided. Note that epochs is to be understood as “final epoch”. The model is not trained for a number of iterations given by epochs, but merely until the epoch of index epochs is reached.

If this is set to -1 then the epochs will be set to an arbitrarily large value. In this case, the early_stopping calback should be used to determine when to stop training the model.

Note

The larger this value is, the longer the model will take to train

property batch_size

Number of samples per gradient update

Default: 32

Typical values are: 16, 32, 64.

Typically, the larger this value is, the more RAM that is required during training.

property optimizer

String (name of optimizer) or optimizer instance

Default: adam

property metrics

List of metrics to be evaluated by the model during training and testing

Default: ['accuracy']

property loss

String (name of objective function), objective function

Default: categorical_crossentropy

property checkpoints_enabled

If true, enable saving a checkpoint after each training epoch.

Default: True

This is useful as it allows for resuming training sessions with the --resume argument to the train command.

Note

This is independent of checkpoint. This saves each epoch’s weights to the logdir/train/checkpoints directory regardless of the what’s configured in checkpoint

property train_callbacks

List of keras.callbacks.Callback instances.

Default: []

List of callbacks to apply during training.

Note

If a callback is found in this list, then the corresponding callback setting is ignore. e.g.: If LearningRateScheduler Callback is found in this list, then lr_schedule is ignored.

property lr_schedule

Learning rate scheduler

Default: None

dict(
    schedule, # a function that takes an epoch index (integer, indexed from 0)
              # and current learning rate (float) as inputs and returns a new learning rate as output (float).

    verbose=0 # int. 0: quiet, 1: update messages.
)

Note

Set to None to disable

At the beginning of every epoch, the this callback gets the updated learning rate value from schedule function provided, with the current epoch and current learning rate, and applies the updated learning rate on the optimizer.

Note

property reduce_lr_on_plateau

Reduce learning rate when a metric has stopped improving

Default: None

Possible values:

dict(
    monitor="val_loss",   # quantity to be monitored.

    factor=0.1,           # factor by which the learning rate will be reduced. new_lr = lr * factor.

    patience=10,          # number of epochs with no improvement after which learning rate will be reduced.

    mode="auto",          # one of {'auto', 'min', 'max'}. In 'min' mode, the learning rate will be reduced
                          # when the quantity monitored has stopped decreasing; in 'max' mode it will be reduced
                          # when the quantity monitored has stopped increasing; in 'auto' mode, the direction is
                          # automatically inferred from the name of the monitored quantity.

    min_delta=0.0001,     # threshold for measuring the new optimum, to only focus on significant changes.

    cooldown=0,           # number of epochs to wait before resuming normal operation after lr has been reduced.

    min_lr=0,             # lower bound on the learning rate.

    verbose=1,            # int. 0: quiet, 1: update messages.
)

Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. This callback monitors a quantity and if no improvement is seen for a ‘patience’ number of epochs, the learning rate is reduced.

Note

  • Set to None to disable this callback

  • If lr_schedule is enabled then this callback is automatically disabled

property tensorboard

Enable visualizations for TensorBoard

Default:

dict(
     histogram_freq=1,       # frequency (in epochs) at which to compute activation and weight histograms
                             # for the layers of the model. If set to 0, histograms won't be computed.
                             # Validation data (or split) must be specified for histogram visualizations.

     write_graph=True,       # whether to visualize the graph in TensorBoard. The log file can become quite large when write_graph is set to True.

     write_images=False,     # whether to write model weights to visualize as image in TensorBoard.

     update_freq="epoch",    # 'batch' or 'epoch' or integer. When using 'batch', writes the losses and metrics
                             # to TensorBoard after each batch. The same applies for 'epoch'.
                             # If using an integer, let's say 1000, the callback will write the metrics and losses
                             # to TensorBoard every 1000 batches. Note that writing too frequently to
                             # TensorBoard can slow down your training.

     profile_batch=2,        # Profile the batch(es) to sample compute characteristics.
                             # profile_batch must be a non-negative integer or a tuple of integers.
                             # A pair of positive integers signify a range of batches to profile.
                             # By default, it will profile the second batch. Set profile_batch=0 to disable profiling.
 )

This callback logs events for TensorBoard, including:

  • Metrics summary plots

  • Training graph visualization

  • Activation histograms

  • Sampled profiling

Note

property checkpoint

Callback to save the Keras model or model weights at some frequency

Default:

dict(
     monitor="val_accuracy",   # The metric name to monitor. Typically the metrics are set by the Model.compile method.
                               # Note:
                               # - Prefix the name with "val_" to monitor validation metrics.
                               # - Use "loss" or "val_loss" to monitor the model's total loss.
                               # - If you specify metrics as strings, like "accuracy", pass the same string (with or without the "val_" prefix).
                               # - If you pass metrics.Metric objects, monitor should be set to metric.name
                               # - If you're not sure about the metric names you can check the contents of the history.history dictionary returned by history = model.fit()
                               # - Multi-output models set additional prefixes on the metric names.

     save_best_only=True,      # if save_best_only=True, it only saves when the model is considered the "best"
                               # and the latest best model according to the quantity monitored will not be overwritten.
                               # If filepath doesn't contain formatting options like {epoch} then filepath will be overwritten by each new better model.

     save_weights_only=True,   # if True, then only the model's weights will be saved (model.save_weights(filepath)),
                               # else the full model is saved (model.save(filepath)).

     mode="auto",              # one of {'auto', 'min', 'max'}. If save_best_only=True, the decision to overwrite
                               # the current save file is made based on either the maximization or the minimization of the
                               # monitored quantity. For val_acc, this should be max, for val_loss this should be min, etc.
                               # In auto mode, the direction is automatically inferred from the name of the monitored quantity.

     save_freq="epoch",        # 'epoch' or integer. When using 'epoch', the callback saves the model after each epoch.
                               # When using integer, the callback saves the model at end of this many batches.
                               # If the Model is compiled with steps_per_execution=N, then the saving criteria will be
                               # checked every Nth batch. Note that if the saving isn't aligned to epochs,
                               # the monitored metric may potentially be less reliable (it could reflect as little
                               # as 1 batch, since the metrics get reset every epoch). Defaults to 'epoch'.

     options=None,             # Optional tf.train.CheckpointOptions object if save_weights_only is true or optional
                               # tf.saved_model.SaveOptions object if save_weights_only is false.

     verbose=0,                # verbosity mode, 0 or 1.
 )

ModelCheckpoint callback is used in conjunction with training using model.fit() to save a model or weights (in a checkpoint file) at some interval, so the model or weights can be loaded later to continue the training from the state saved.

Note

  • Set to None to disable this callback

  • Tensorboard logs are saved to MltkModel.log_dir/train/weights

  • This is independent of checkpoints_enabled.

property early_stopping

Stop training when a monitored metric has stopped improving

Default: None

Possible values:

dict(
    monitor="val_accuracy",     # Quantity to be monitored.

    min_delta=0,                # Minimum change in the monitored quantity to qualify as an improvement,
                                # i.e. an absolute change of less than min_delta, will count as no improvement.

    patience=25,                # Number of epochs with no improvement after which training will be stopped.

    mode="auto",                # One of {"auto", "min", "max"}. In min mode, training will stop when the quantity
                                # monitored has stopped decreasing; in "max" mode it will stop when the quantity monitored
                                # has stopped increasing; in "auto" mode, the direction is automatically inferred from
                                # the name of the monitored quantity.

    baseline=None,              # Baseline value for the monitored quantity. Training will stop if
                                # the model doesn't show improvement over the baseline.

    restore_best_weights=True,  # Whether to restore model weights from the epoch with the best value of the monitored quantity.
                                # If False, the model weights obtained at the last step of training are used.

    verbose=1,                  # verbosity mode.
)

Assuming the goal of a training is to minimize the loss. With this, the metric to be monitored would be ‘loss’, and mode would be ‘min’. A model.fit() training loop will check at end of every epoch whether the loss is no longer decreasing, considering the min_delta and patience if applicable. Once it’s found no longer decreasing, model.stop_training is marked True and the training terminates.

Note

  • Set to None to disable this callback

  • Set epochs to -1 to always train until early stopping is triggered

property tflite_converter

Converts a TensorFlow model into TensorFlow Lite model

Default:

dict(
    optimizations = [tf.lite.Optimize.DEFAULT],             # Experimental flag, subject to change.
                                                            # A list of optimizations to apply when converting the model. E.g. [Optimize.DEFAULT]

    supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8],  # Experimental flag, subject to change. Set of OpsSet options supported by the device.
                                                            # Add to the 'target_spec' option
                                                            # https://www.tensorflow.org/api_docs/python/tf/lite/TargetSpec

    inference_input_type = tf.float32,                      # Data type of the input layer. Note that integer types (tf.int8 and tf.uint8) are
                                                            # currently only supported for post training integer quantization and quantization aware training.
                                                            # (default tf.float32, must be in {tf.float32, tf.int8, tf.uint8})

    inference_output_type = tf.float32,                     # Data type of the output layer. Note that integer types (tf.int8 and tf.uint8) are currently only
                                                            # supported for post training integer quantization and quantization aware training.
                                                            # (default tf.float32, must be in {tf.float32, tf.int8, tf.uint8})

    representative_dataset = 'generate',                    # A representative dataset that can be used to generate input and output samples
                                                            # for the model. The converter can use the dataset to evaluate different optimizations.
                                                            # Note that this is an optional attribute but it is necessary if INT8 is the only
                                                            # support builtin ops in target ops.
                                                            # If the keyword 'generate' is used, then use update to 1000 samples from the model's
                                                            # validation dataset as the representative dataset

    allow_custom_ops = False,                               # Boolean indicating whether to allow custom operations. When False, any unknown operation is an error.
                                                            # When True, custom ops are created for any op that is unknown. The developer needs to provide these to the
                                                            # TensorFlow Lite runtime with a custom resolver. (default False)

    experimental_new_converter = True,                      # Experimental flag, subject to change. Enables MLIR-based conversion instead of TOCO conversion. (default True)

    experimental_new_quantizer = True,                      # Experimental flag, subject to change. Enables MLIR-based quantization conversion instead of Flatbuffer-based conversion. (default True)

    experimental_enable_resource_variables = False,         # Experimental flag, subject to change. Enables resource variables to be converted by this converter.
                                                            # This is only allowed if from_saved_model interface is used. (default False)

    generate_unquantized = True                             # Also generate a float32/unquantized .tflite model in addition to the quantized .tflite model
)

This is used after the model finishes training. The trained Keras .h5 model file is converted to a .tflite file using the TFLiteConverter using the settings specified by this field.

If generate_unquantized=True then a quantized .tflite AND an unquantized .tflite model files with be generated. If you ONLY want to generate an unquantized model, then supported_ops = TFLITE_BUILTINS

Note

See on_training_complete to invoke a custom callback which may be used to perform custom quantization

property checkpoints_dir

Return path to directory containing training checkpoints

get_checkpoint_path(epoch=None)[source]

Return the file path to the checkpoint weights for the given epoch

If no epoch is provided then return the best checkpoint weights file is return. Return None if no checkpoint is found.

Note

Checkpoints are only generated if checkpoints_enabled is True.

Return type

str

property weights_dir

Return path to directory contianing training weights

property weights_file_format

Return the file format used to generate model weights files during training

get_weights_path(filename=None)[source]

Return the path to a Keras .h5 weights file

Return type

str

EvaluateMixin

class mltk.core.EvaluateMixin[source]

Provides generic evaluation properties and methods to the base MltkModel

Refer to the Model Evaluation guide for more details.

property eval_steps_per_epoch

Total number of steps (batches of samples) before declaring the prediction round finished. Ignored with the default value of None. If x is a tf.data dataset and steps is None, predict will run until the input dataset is exhausted.

property eval_custom_function

Custom evaluation callback

This is invoked during the mltk.core.evaluate_model() API.

The given function should have the following signature:

my_custom_eval_function(my_model:MyModel, built_model: Union[KerasModel, TfliteModel]) -> EvaluationResults:
    results = EvaluationResults(name=my_model.name)

    if isinstance(built_model, KerasModel):
        results['overall_accuracy] = calculate_accuracy(built_model)
    return results

EvaluateClassifierMixin

class mltk.core.EvaluateClassifierMixin[source]

Provides evaluation properties and methods to the base MltkModel

Note

This mixin is specific to “classification” models

Refer to the Model Evaluation guide for more details.

property eval_shuffle

Shuffle data during evaluation Default: False

property eval_augment

Enable random augmentations during evaluation Default: False NOTE: This is only used if the DataGeneratorDatasetMixin or sub-class is used by the MltkModel

property eval_custom_function

Custom evaluation callback

This is invoked during the mltk.core.evaluate_model() API.

The given function should have the following signature:

my_custom_eval_function(my_model:MyModel, built_model: Union[KerasModel, TfliteModel]) -> EvaluationResults:
    results = EvaluationResults(name=my_model.name)

    if isinstance(built_model, KerasModel):
        results['overall_accuracy] = calculate_accuracy(built_model)
    return results
property eval_steps_per_epoch

Total number of steps (batches of samples) before declaring the prediction round finished. Ignored with the default value of None. If x is a tf.data dataset and steps is None, predict will run until the input dataset is exhausted.

property eval_max_samples_per_class

The maximum number of samples for a given class to use during evaluation. If -1 then use all available samples Default: -1

EvaluateAutoEncoderMixin

class mltk.core.EvaluateAutoEncoderMixin[source]

Provides evaluation properties and methods to the base MltkModel

Note

This mixin is specific to “auto-encoder” models

Refer to the Model Evaluation guide for more details.

property scoring_function

The auto-encoder scoring function to use during evaluation

If None, then use the mltk_model.loss function

Default: None

property eval_classes

List if classes to use for evaluation. The first element should be considered the ‘normal’ class, every other class is considered abnormal and compared independently. This is used if the –classes argument is not supplied to the eval command.

Default: [normal, abnormal]

property eval_augment

Enable random augmentations during evaluation Default: False NOTE: This is only used if the DataGeneratorDatasetMixin or sub-class is used by the MltkModel

property eval_custom_function

Custom evaluation callback

This is invoked during the mltk.core.evaluate_model() API.

The given function should have the following signature:

my_custom_eval_function(my_model:MyModel, built_model: Union[KerasModel, TfliteModel]) -> EvaluationResults:
    results = EvaluationResults(name=my_model.name)

    if isinstance(built_model, KerasModel):
        results['overall_accuracy] = calculate_accuracy(built_model)
    return results
property eval_max_samples_per_class

The maximum number of samples for a given class to use during evaluation. If -1 then use all available samples Default: -1

property eval_shuffle

Shuffle data during evaluation Default: False

property eval_steps_per_epoch

Total number of steps (batches of samples) before declaring the prediction round finished. Ignored with the default value of None. If x is a tf.data dataset and steps is None, predict will run until the input dataset is exhausted.

get_scoring_function()[source]

Return the scoring function used during evaluation

Return type

Callable

DatasetMixin

class mltk.core.DatasetMixin[source]

Provides generic dataset properties to the base MltkModel

Refer to te Model Specification guide for more details.

property x

Input data

It could be:

  • A Numpy array (or array-like), or a list of arrays (in case the model has multiple inputs).

  • A TensorFlow tensor, or a list of tensors (in case the model has multiple inputs).

  • A dict mapping input names to the corresponding array/tensors, if the model has named inputs.

  • A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights).

  • A generator or keras.utils.Sequence returning (inputs, targets) or (inputs, targets, sample_weights). A more detailed description of unpacking behavior for iterator types (Dataset, generator, Sequence) is given below.

property y

Target data

Like the input data x, it could be either Numpy array(s) or TensorFlow tensor(s). It should be consistent with x (you cannot have Numpy inputs and tensor targets, or inversely). If x is a dataset, generator, or keras.utils.Sequence instance, y should not be specified (since targets will be obtained from x).

property validation_split

Float between 0 and 1 Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling. This argument is not supported when x is a dataset, generator or keras.utils.Sequence instance.

property validation_data

Data on which to evaluate the loss and any model metrics at the end of each epoch. The model will not be trained on this data. Thus, note the fact that the validation loss of data provided using validation_split or validation_data is not affected by regularization layers like noise and dropout. validation_data will override validation_split. validation_data could be:

  • tuple (x_val, y_val) of Numpy arrays or tensors

  • tuple (x_val, y_val, val_sample_weights) of Numpy arrays

  • dataset For the first two cases, batch_size must be provided. For the last case, validation_steps could be provided. Note that validation_data does not support all the data types that are supported in x, eg, dict, generator or keras.utils.Sequence.

property shuffle

Boolean (whether to shuffle the training data before each epoch) or str (for ‘batch’). This argument is ignored when x is a generator. ‘batch’ is a special option for dealing with the limitations of HDF5 data; it shuffles in batch-sized chunks. Has no effect when steps_per_epoch is not None.

property class_weights

Specifies how class weights should be calculated. Default: None

This can be useful to tell the model to “pay more attention” to samples from an under-represented class.

May be one of the following:

  • If balanced is given, class weights will be given by: n_samples / (n_classes * np.bincount(y))

  • If a dictionary is given, keys are classes and values are corresponding class weights.

  • If None is given, the class weights will be uniform.

property sample_weight

Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only). You can either pass a flat (1D) Numpy array with the same length as the input samples (1:1 mapping between weights and samples), or in the case of temporal data, you can pass a 2D array with shape (samples, sequence_length), to apply a different weight to every timestep of every sample. This argument is not supported when x is a dataset, generator, or keras.utils.Sequence instance, instead provide the sample_weights as the third element of x.

property steps_per_epoch

Integer or None. Total number of steps (batches of samples) before declaring one epoch finished and starting the next epoch. When training with input tensors such as TensorFlow data tensors, the default None is equal to the number of samples in your dataset divided by the batch size, or 1 if that cannot be determined. If x is a tf.data dataset, and ‘steps_per_epoch’ is None, the epoch will run until the input dataset is exhausted. When passing an infinitely repeating dataset, you must specify the steps_per_epoch argument. This argument is not supported with array inputs.

property validation_steps

Only relevant if validation_data is provided and is a tf.data dataset. Total number of steps (batches of samples) to draw before stopping when performing validation at the end of every epoch. If ‘validation_steps’ is None, validation will run until the validation_data dataset is exhausted. In the case of an infinitely repeated dataset, it will run into an infinite loop. If ‘validation_steps’ is specified and only part of the dataset will be consumed, the evaluation will start from the beginning of the dataset at each epoch. This ensures that the same validation samples are used every time.

property validation_batch_size

Integer or None. Number of samples per validation batch. If unspecified, will default to batch_size. Do not specify the validation_batch_size if your data is in the form of datasets, generators, or keras.utils.Sequence instances (since they generate batches).

property validation_freq

Only relevant if validation data is provided. Integer or collections_abc.Container instance (e.g. list, tuple, etc.). If an integer, specifies how many training epochs to run before a new validation run is performed, e.g. validation_freq=2 runs validation every 2 epochs. If a Container, specifies the epochs on which to run validation, e.g. validation_freq=[1, 2, 10] runs validation at the end of the 1st, 2nd, and 10th epochs.

property loaded_subset

training, validation, evaluation

Type

The currently loaded dataset subset

load_dataset(subset, test=False, **kwargs)[source]

Load the dataset

Note

By default this API does not do anything. It should be overridden by a parent class.

Parameters
  • subset (str) – The dataset subset: training, validation or evaluation

  • test (bool) – If true then only load a few samples for testing

unload_dataset()[source]

Unload the dataset

DataGeneratorDatasetMixin

class mltk.core.DataGeneratorDatasetMixin[source]

Provides generic data generator properties to the base MltkModel

property class_weights

Specifies how class weights should be calculated. Default: None

This can be useful to tell the model to “pay more attention” to samples from an under-represented class.

May be one of the following:

  • If balanced is given, class weights will be given by: n_samples / (n_classes * np.bincount(y))

  • If a dictionary is given, keys are classes and values are corresponding class weights.

  • If None is given, the class weights will be uniform.

property loaded_subset

training, validation, evaluation

Type

The currently loaded dataset subset

property sample_weight

Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only). You can either pass a flat (1D) Numpy array with the same length as the input samples (1:1 mapping between weights and samples), or in the case of temporal data, you can pass a 2D array with shape (samples, sequence_length), to apply a different weight to every timestep of every sample. This argument is not supported when x is a dataset, generator, or keras.utils.Sequence instance, instead provide the sample_weights as the third element of x.

property shuffle

Boolean (whether to shuffle the training data before each epoch) or str (for ‘batch’). This argument is ignored when x is a generator. ‘batch’ is a special option for dealing with the limitations of HDF5 data; it shuffles in batch-sized chunks. Has no effect when steps_per_epoch is not None.

property steps_per_epoch

Integer or None. Total number of steps (batches of samples) before declaring one epoch finished and starting the next epoch. When training with input tensors such as TensorFlow data tensors, the default None is equal to the number of samples in your dataset divided by the batch size, or 1 if that cannot be determined. If x is a tf.data dataset, and ‘steps_per_epoch’ is None, the epoch will run until the input dataset is exhausted. When passing an infinitely repeating dataset, you must specify the steps_per_epoch argument. This argument is not supported with array inputs.

property validation_batch_size

Integer or None. Number of samples per validation batch. If unspecified, will default to batch_size. Do not specify the validation_batch_size if your data is in the form of datasets, generators, or keras.utils.Sequence instances (since they generate batches).

property validation_data

Data on which to evaluate the loss and any model metrics at the end of each epoch. The model will not be trained on this data. Thus, note the fact that the validation loss of data provided using validation_split or validation_data is not affected by regularization layers like noise and dropout. validation_data will override validation_split. validation_data could be:

  • tuple (x_val, y_val) of Numpy arrays or tensors

  • tuple (x_val, y_val, val_sample_weights) of Numpy arrays

  • dataset For the first two cases, batch_size must be provided. For the last case, validation_steps could be provided. Note that validation_data does not support all the data types that are supported in x, eg, dict, generator or keras.utils.Sequence.

property validation_freq

Only relevant if validation data is provided. Integer or collections_abc.Container instance (e.g. list, tuple, etc.). If an integer, specifies how many training epochs to run before a new validation run is performed, e.g. validation_freq=2 runs validation every 2 epochs. If a Container, specifies the epochs on which to run validation, e.g. validation_freq=[1, 2, 10] runs validation at the end of the 1st, 2nd, and 10th epochs.

property validation_split

Float between 0 and 1 Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling. This argument is not supported when x is a dataset, generator or keras.utils.Sequence instance.

property validation_steps

Only relevant if validation_data is provided and is a tf.data dataset. Total number of steps (batches of samples) to draw before stopping when performing validation at the end of every epoch. If ‘validation_steps’ is None, validation will run until the validation_data dataset is exhausted. In the case of an infinitely repeated dataset, it will run into an infinite loop. If ‘validation_steps’ is specified and only part of the dataset will be consumed, the evaluation will start from the beginning of the dataset at each epoch. This ensures that the same validation samples are used every time.

property x

Input data

It could be:

  • A Numpy array (or array-like), or a list of arrays (in case the model has multiple inputs).

  • A TensorFlow tensor, or a list of tensors (in case the model has multiple inputs).

  • A dict mapping input names to the corresponding array/tensors, if the model has named inputs.

  • A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights).

  • A generator or keras.utils.Sequence returning (inputs, targets) or (inputs, targets, sample_weights). A more detailed description of unpacking behavior for iterator types (Dataset, generator, Sequence) is given below.

property y

Target data

Like the input data x, it could be either Numpy array(s) or TensorFlow tensor(s). It should be consistent with x (you cannot have Numpy inputs and tensor targets, or inversely). If x is a dataset, generator, or keras.utils.Sequence instance, y should not be specified (since targets will be obtained from x).

property datagen_context

Loaded data generator’s context

get_datagen_creator(subset)[source]

Return an object that creates a data generator for the given subset

unload_dataset()[source]

Unload the dataset

get_shuffle_index_dir()[source]

The ParallelImageGenerator and ParallelImageGenerator have the option to shuffle the dataset entries once before they’re used. The shuffled indices are then saved to a file. The saved indices file is added to the generated model archive. This function loads the indices file from the archive during evaluation and validation.

Note

We do NOT want to shuffle during eval/validation so that results are reproducible (hence we use the one-time-generated indices file)

Return type

str

AudioDatasetMixin

class mltk.core.AudioDatasetMixin[source]

Provides audio dataset properties to the base MltkModel

property dataset

Path to the audio dataset’s python module, a function that manually loads the datset, or the file path to a directory of samples.

If a Python module is provided, it must implement the function:

def load_data():
   ...

which should return the file path to the dataset’s directory

If a function is provided, the function should return the path to a directory containing the dataset’s samples.

Whether to follow symlinks inside class subdirectories

Default: True

property shuffle_dataset_enabled

Shuffle the dataset directory once

Default: false

  • If true, the dataset directory will be shuffled the first time it is processed and

    and an index containing the shuffled file names is generated in the training log directory. The index is reused to maintain the shuffled order for subsequent processing.

  • If false, then the dataset samples are sorted alphabetically and saved to an index in the dataset directory.

    The alphabetical index file is used for subsequent processing.

property class_mode

Determines the type of label arrays that are returned.

Default: categorical

  • categorical - 2D one-hot encoded labels

  • binary - 1D binary labels

  • sparse - 1D integer labels

  • input - images identical to input images (mainly used to work with autoencoders)

property audio_classes

Return a list of class labels the model should classify

property audio_input_shape

Get the shape of the spectrogram generated by the mltk.core.preprocess.audio.audio_feature_generator.AudioFeatureGenerator as (height, width, 1)

Note

If frontend_enabled = True then the input size is automatically calculated based the on the mltk.core.preprocess.audio.audio_feature_generator.AudioFeatureGeneratorSettings If frontend_enabled = False then the input size must be manually specified.

property sample_length_ms

Get the data generator sample length in milliseconds

property sample_rate_hz

Get the data generator sample rate in hertz

property class_weights

Specifies how class weights should be calculated. Default: None

This can be useful to tell the model to “pay more attention” to samples from an under-represented class.

May be one of the following:

  • If balanced is given, class weights will be given by: n_samples / (n_classes * np.bincount(y))

  • If a dictionary is given, keys are classes and values are corresponding class weights.

  • If None is given, the class weights will be uniform.

property datagen_context

Loaded data generator’s context

property frontend_settings

Get the data generator’s mltk.core.preprocess.audio.audio_feature_generator.AudioFeatureGeneratorSettingsSettings

property loaded_subset

training, validation, evaluation

Type

The currently loaded dataset subset

property sample_weight

Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only). You can either pass a flat (1D) Numpy array with the same length as the input samples (1:1 mapping between weights and samples), or in the case of temporal data, you can pass a 2D array with shape (samples, sequence_length), to apply a different weight to every timestep of every sample. This argument is not supported when x is a dataset, generator, or keras.utils.Sequence instance, instead provide the sample_weights as the third element of x.

property shuffle

Boolean (whether to shuffle the training data before each epoch) or str (for ‘batch’). This argument is ignored when x is a generator. ‘batch’ is a special option for dealing with the limitations of HDF5 data; it shuffles in batch-sized chunks. Has no effect when steps_per_epoch is not None.

property steps_per_epoch

Integer or None. Total number of steps (batches of samples) before declaring one epoch finished and starting the next epoch. When training with input tensors such as TensorFlow data tensors, the default None is equal to the number of samples in your dataset divided by the batch size, or 1 if that cannot be determined. If x is a tf.data dataset, and ‘steps_per_epoch’ is None, the epoch will run until the input dataset is exhausted. When passing an infinitely repeating dataset, you must specify the steps_per_epoch argument. This argument is not supported with array inputs.

property validation_batch_size

Integer or None. Number of samples per validation batch. If unspecified, will default to batch_size. Do not specify the validation_batch_size if your data is in the form of datasets, generators, or keras.utils.Sequence instances (since they generate batches).

property validation_data

Data on which to evaluate the loss and any model metrics at the end of each epoch. The model will not be trained on this data. Thus, note the fact that the validation loss of data provided using validation_split or validation_data is not affected by regularization layers like noise and dropout. validation_data will override validation_split. validation_data could be:

  • tuple (x_val, y_val) of Numpy arrays or tensors

  • tuple (x_val, y_val, val_sample_weights) of Numpy arrays

  • dataset For the first two cases, batch_size must be provided. For the last case, validation_steps could be provided. Note that validation_data does not support all the data types that are supported in x, eg, dict, generator or keras.utils.Sequence.

property validation_freq

Only relevant if validation data is provided. Integer or collections_abc.Container instance (e.g. list, tuple, etc.). If an integer, specifies how many training epochs to run before a new validation run is performed, e.g. validation_freq=2 runs validation every 2 epochs. If a Container, specifies the epochs on which to run validation, e.g. validation_freq=[1, 2, 10] runs validation at the end of the 1st, 2nd, and 10th epochs.

property validation_split

Float between 0 and 1 Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling. This argument is not supported when x is a dataset, generator or keras.utils.Sequence instance.

property validation_steps

Only relevant if validation_data is provided and is a tf.data dataset. Total number of steps (batches of samples) to draw before stopping when performing validation at the end of every epoch. If ‘validation_steps’ is None, validation will run until the validation_data dataset is exhausted. In the case of an infinitely repeated dataset, it will run into an infinite loop. If ‘validation_steps’ is specified and only part of the dataset will be consumed, the evaluation will start from the beginning of the dataset at each epoch. This ensures that the same validation samples are used every time.

property x

Input data

It could be:

  • A Numpy array (or array-like), or a list of arrays (in case the model has multiple inputs).

  • A TensorFlow tensor, or a list of tensors (in case the model has multiple inputs).

  • A dict mapping input names to the corresponding array/tensors, if the model has named inputs.

  • A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights).

  • A generator or keras.utils.Sequence returning (inputs, targets) or (inputs, targets, sample_weights). A more detailed description of unpacking behavior for iterator types (Dataset, generator, Sequence) is given below.

property y

Target data

Like the input data x, it could be either Numpy array(s) or TensorFlow tensor(s). It should be consistent with x (you cannot have Numpy inputs and tensor targets, or inversely). If x is a dataset, generator, or keras.utils.Sequence instance, y should not be specified (since targets will be obtained from x).

property datagen

Training data generator.

Should be a reference to a mltk.core.preprocess.audio.parallel_generator.ParallelAudioDataGenerator instance

property validation_datagen

Validation/evaluation data generator.

If omitted, then datagen is used for validation and evaluation.

Should be a reference to a mltk.core.preprocess.audio.parallel_generator.ParallelAudioDataGenerator instance

load_dataset(subset, classes=None, max_samples_per_class=- 1, test=False, **kwargs)[source]

Pre-process the dataset and prepare the model dataset attributes

Parameters

subset (str) – Data subset name

ImageDatasetMixin

class mltk.core.ImageDatasetMixin[source]

Provides image dataset properties to the base MltkModel

property dataset

Path to the image dataset’s python module, a function that manually loads the datset, or the file path to a directory of samples.

If a Python module is provided, it must implement the function:

def load_data():
   ...

The load_data() function should either return a tuple as: (x_train, y_train), (x_test, y_test) OR it should return the path to a directory containing the dataset’s samples.

If a function is provided, the function should return the tuple: (x_train, y_train), (x_test, y_test) OR it should return the path to a directory containing the dataset’s samples.

Whether to follow symlinks inside class sub-directories

Default: True

property shuffle_dataset_enabled

Shuffle the dataset directory once

Default: false

  • If true, the dataset directory will be shuffled the first time it is processed and

    and an index containing the shuffled file names is generated in the training log directory. The index is reused to maintain the shuffled order for subsequent processing.

  • If false, then the dataset samples are sorted alphabetically and saved to an index in the dataset directory.

    The alphabetical index file is used for subsequent processing.

property image_classes

Return a list of class labels the model should classify

property image_input_shape

Return the image input shape as a tuple of integers

property target_size

Return the target size of the generated images. The image data generator will automatically resize all images to this size. If omitted, my_model.input_shape is used.

Note

This is only used if providing a directory image dataset

property class_mode

Determines the type of label arrays that are returned. Default: categorical

  • categorical - 2D one-hot encoded labels

  • binary - 1D binary labels

  • sparse - 1D integer labels

  • input - images identical to input images (mainly used to work with autoencoders)

property color_mode

The type of image data to use

Default: auto

May be one of the following:

  • auto - Automatically determine the color mode based on the input shape channels

  • grayscale - Convert the images to grayscale (if necessary). The put shape must only have 1 channel

  • rgb - The input shape must only have 3 channels

  • rgba - The input shape must have 4 channels

property interpolation

Interpolation method used to resample the image if the target size is different from that of the loaded image

Default: bilinear

Supported methods are none, nearest, bilinear, bicubic, lanczos, box and hamming . If none is used then the generated images are not automatically resized. In this case, the mltk.core.preprocess.image.parallel_generator.ParallelImageDataGenerator preprocessing_function argument should be used to reshape the image to the expected model input shape.

property class_weights

Specifies how class weights should be calculated. Default: None

This can be useful to tell the model to “pay more attention” to samples from an under-represented class.

May be one of the following:

  • If balanced is given, class weights will be given by: n_samples / (n_classes * np.bincount(y))

  • If a dictionary is given, keys are classes and values are corresponding class weights.

  • If None is given, the class weights will be uniform.

property datagen_context

Loaded data generator’s context

property loaded_subset

training, validation, evaluation

Type

The currently loaded dataset subset

property sample_weight

Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only). You can either pass a flat (1D) Numpy array with the same length as the input samples (1:1 mapping between weights and samples), or in the case of temporal data, you can pass a 2D array with shape (samples, sequence_length), to apply a different weight to every timestep of every sample. This argument is not supported when x is a dataset, generator, or keras.utils.Sequence instance, instead provide the sample_weights as the third element of x.

property shuffle

Boolean (whether to shuffle the training data before each epoch) or str (for ‘batch’). This argument is ignored when x is a generator. ‘batch’ is a special option for dealing with the limitations of HDF5 data; it shuffles in batch-sized chunks. Has no effect when steps_per_epoch is not None.

property steps_per_epoch

Integer or None. Total number of steps (batches of samples) before declaring one epoch finished and starting the next epoch. When training with input tensors such as TensorFlow data tensors, the default None is equal to the number of samples in your dataset divided by the batch size, or 1 if that cannot be determined. If x is a tf.data dataset, and ‘steps_per_epoch’ is None, the epoch will run until the input dataset is exhausted. When passing an infinitely repeating dataset, you must specify the steps_per_epoch argument. This argument is not supported with array inputs.

property validation_batch_size

Integer or None. Number of samples per validation batch. If unspecified, will default to batch_size. Do not specify the validation_batch_size if your data is in the form of datasets, generators, or keras.utils.Sequence instances (since they generate batches).

property validation_data

Data on which to evaluate the loss and any model metrics at the end of each epoch. The model will not be trained on this data. Thus, note the fact that the validation loss of data provided using validation_split or validation_data is not affected by regularization layers like noise and dropout. validation_data will override validation_split. validation_data could be:

  • tuple (x_val, y_val) of Numpy arrays or tensors

  • tuple (x_val, y_val, val_sample_weights) of Numpy arrays

  • dataset For the first two cases, batch_size must be provided. For the last case, validation_steps could be provided. Note that validation_data does not support all the data types that are supported in x, eg, dict, generator or keras.utils.Sequence.

property validation_freq

Only relevant if validation data is provided. Integer or collections_abc.Container instance (e.g. list, tuple, etc.). If an integer, specifies how many training epochs to run before a new validation run is performed, e.g. validation_freq=2 runs validation every 2 epochs. If a Container, specifies the epochs on which to run validation, e.g. validation_freq=[1, 2, 10] runs validation at the end of the 1st, 2nd, and 10th epochs.

property validation_split

Float between 0 and 1 Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling. This argument is not supported when x is a dataset, generator or keras.utils.Sequence instance.

property validation_steps

Only relevant if validation_data is provided and is a tf.data dataset. Total number of steps (batches of samples) to draw before stopping when performing validation at the end of every epoch. If ‘validation_steps’ is None, validation will run until the validation_data dataset is exhausted. In the case of an infinitely repeated dataset, it will run into an infinite loop. If ‘validation_steps’ is specified and only part of the dataset will be consumed, the evaluation will start from the beginning of the dataset at each epoch. This ensures that the same validation samples are used every time.

property x

Input data

It could be:

  • A Numpy array (or array-like), or a list of arrays (in case the model has multiple inputs).

  • A TensorFlow tensor, or a list of tensors (in case the model has multiple inputs).

  • A dict mapping input names to the corresponding array/tensors, if the model has named inputs.

  • A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights).

  • A generator or keras.utils.Sequence returning (inputs, targets) or (inputs, targets, sample_weights). A more detailed description of unpacking behavior for iterator types (Dataset, generator, Sequence) is given below.

property y

Target data

Like the input data x, it could be either Numpy array(s) or TensorFlow tensor(s). It should be consistent with x (you cannot have Numpy inputs and tensor targets, or inversely). If x is a dataset, generator, or keras.utils.Sequence instance, y should not be specified (since targets will be obtained from x).

property datagen

Training data generator.

Should be a reference to a mltk.core.preprocess.image.parallel_generator.ParallelImageDataGenerator instance OR tensorflow.keras.preprocessing.image.ImageDataGenerator

property validation_datagen

Validation/evaluation data generator.

If omitted, then datagen is used for validation and evaluation.

Should be a reference to a mltk.core.preprocess.image.parallel_generator.ParallelImageDataGenerator instance OR tensorflow.keras.preprocessing.image.ImageDataGenerator

load_dataset(subset, classes=None, max_samples_per_class=- 1, test=False, **kwargs)[source]

Pre-process the dataset and prepare the model dataset attributes