Models Examples

Storing label enumeration

Use the Task.set_model_label_enumeration() method to store class enumeration:

Task.current_task().set_model_label_enumeration( {"label": int(0), } )

Manually configuring models

manual_model_config.py is an example of manually configuring a model, model storage, label enumeration values, and logging.

The experiment results tabs that contain the features in this example are the following:

  • ARTIFACTS
    • Output model (a link to the output model details on the Projects page, models table, including label enumeration values).
    • Model Configuration.
  • RESULTS
    • LOG - Console standard output/error.

Manually logging models

Use the InputModel.import_model method and Task.connect() methods to manually connect an input model. Use the Task.update_weights method to manually connect a model weights file.

input_model = InputModel.import_model(link_to_initial_model_file)
Task.current_task().connect(input_model)

OutputModel(Task.current_task()).update_weights(link_to_new_model_file_here)

Checkpointing models

Trains supports several ways to set a destination location for checkpointing models, including:

  • In experiment code, call the Task.init method using the output_uri parameter (see example below).
  • On the machine or using a Trains Agent worker, in the Trains configuration file, default_output_uri option.
  • Rerunning, reproducing, and tuning experiments using the Trains Server Web-App (UI).

The storage destination can be a folder, URI, or bucket. If you use object storage such as S3, provide your storage credentials in the Trains configuration file.

For example, set Trains Server as the destination location in code by specifying output_uri="http://my_trains_server:8081/" when you call the Task.init method. Trains will automatically upload any checkpoint to that server. Since they will be on that server, you can download them later.

Another example is setting the destination location in code as a shared folder.

task = Task.init(project_name, task_name, output_uri="/mnt/shared/folder")

Trains will copy all stored snapshots into a subfolder under /mnt/shared/folder. The subfolder's name will contain the experiment's ID. If the experiment's ID is 6ea4f0b56d994320a713aeaf13a86d9d, the following folder will be used:

/mnt/shared/folder/task.6ea4f0b56d994320a713aeaf13a86d9d/models/

Trains supports other storage types for output_uri, including:

# AWS S3 bucket
task = Task.init(project_name, task_name, output_uri="s3://bucket-name/folder")
# Google Cloud Storage bucket
task = Task.init(project_name, task_name, output_uri="gs://bucket-name/folder")

To use Cloud storage with Trains, configure the storage credentials in your ~/trains.conf. For detailed information, see Trains Configuration Reference.

Continue previous training

After an experiment trains a model, you can continue training that same model in another experiment, and start the continued training at the iteration where the earlier experiment finished.

For example, if an experiment trains a model for 10000 iterations, a second experiment can run later and start training the same model at iteration 10001.

Here is an example using torch.

The first experiment specifies output_uri as https://localhost:8081 in the call to the Task.init method. This saves the model to the Trains Server.

from trains import Task
task = Task.init(project_name='demo', task_name='train stage1', output_uri='https://localhost:8081')
# some stuff
torch.save('model.pt')

The second experiment also specifies output_uri as https://localhost:8081 to get the model. It calls the Task.get_task method to get the first experiment, the Task.get_last_iteration method to find where the previous training ended, the Task.set_initial_iteration method to set the second experiment to start where the previous training ended, and the Model.get_local_copy method to retrieve the model into the second experiment.

from trains import Task
task = Task.init(project_name='demo', task_name='train stage2', output_uri='https://localhost:8081')
previous_task = Task.get_task(project_name='demo', task_name='train stage1')
task.set_initial_iteration(previous_task.get_last_iteration())
local_model = previous_task.models['output'][-1].get_local_copy()
torch.load(local_model)
# do some stuff
torch.save('model2.pt')