Executing Experiments Remotely

Experiment remote execution in Trains allows you to automatically execute experiments on a single machine, or multiple remote machines. Add any status Draft experiment to a queue, and a worker listening to that queue will execute it.

On this page, we explain how to remotely execute experiments using the Trains Web-App (UI). You can also remotely execute experiments programmatically, see the Automation tutorial, the Hyperparameter optimization and Task pipelining examples, and refer to the Trains Python Client page reference.

Trains provides several ways to use remote execution, which can be used for multiple workflows, including:

  • Rerun an existing experiment, with or without changes.
    • For example, to rerun an experiment for more iterations on a machine with greater resources.
  • Reproduce an experiment, by creating an exact copy of it, and not modifying the copy.
    • For example, to replicate an experiment that previously ran on a different machine.
  • Tune an experiment, by creating an exact copy of it, and then modifying parts of the copy.
    • For example, to tweak an experiment, and then compare it to other experiments.

Requirements

Before executing experiments remotely, you need a running worker, and it must be listening to a queue. You may setup Trains Agent, or a DevOps or other IT group may set it up for you.

To set up Trains Agent and for information about Trains Agent commands, see the Installing and Configuring Trains Agent page, Trains Agent Use Case Examples, and Trains Agent Reference

Rerunning experiments

Rerun an existing experiment, with or without changes. This does not create a new experiment in Trains, however it does overwrite the existing Task object in Trains Server. To not overwrite the existing object, see Reproducing experiments and Tuning experiments.

To rerun an experiment:

  1. On the Projects page, click the project card or the All projects card.

    The project page appears showing the experiments table which contains all active experiments in the project (some inactive experiments may be in the archive).

  2. In the experiment table, right click the experiment > Reset > RESET. The experiment's state (status) becomes Draft.

    This image shows the experiment before resetting. The status is Completed from the previous run, and the log is also from the previous run.

    This image shows the experiment after resetting. The status is now Draft, and the log is empty because resetting deleted the log from the previous run.

  3. If you want to make changes to the experiment (for example, select different source code, or tune the hyperparameters), see Modifying experiments.
  4. Right click the experiment > Enqueue > Select a queue > ENQUEUE.

    This image shows the experiment after enqueuing it. The status is now Pending, because the experiment waiting for a worker to fetch it from the queue.

    This image shows the experiment after the worker fetched it from the queue. The status is now Running, because the worker began executing it. The worker is monitoring execution and updating Trains Server. Notice that the log shows the worker is installing Python packages in the Python virtual environment that the worker built in the cache.

    This image shows the experiment after the worker finishes executing it. The status is now Completed. Notice that the log shows output to stdout during training (accuracy at each step), and the final log entry indicating the experiment finished uploading to Trains Server.

You can track the reproduced experiment and compare it other experiments while it is running and after it completes.

Reproducing experiments

Reproduce an existing experiment, by creating an exact copy of it, and not modifying the copy. This creates a new experiment in Trains, and a new Task object in Trains Server. In Trains, we call this cloning. To modify a cloned experiment, see Tuning experiments.

  1. On the Projects page, click the project card or the All projects card.

    The project page appears showing the experiments table which contains all active experiments in the project (some inactive experiments may be in the archive).

  2. In the experiment table, right click the experiment > Clone.
  3. Select a project, type a new for the newly cloned experiment, and optionally type a description.
  4. Click CLONE. The newly cloned experiment's detail pane appears. The experiment's status is Draft.

    This image shows the copied experiment (the clone), with its details pane open, and the original experiment, below the it.

  5. Right click the experiment > Enqueue > Select a queue > ENQUEUE.

Tuning experiments

Tune an existing experiment, by creating an exact copy of it, and then modifying parts of the copy. This creates a new experiment, and a new Task object in Trains.

  1. On the Projects page, click the project card or the All projects card.

    The project page appears showing the experiments table which contains all active experiments in the project (some inactive experiments may be in the archive).

  2. In the experiment table, right click the experiment > Clone.
  3. Select a project, type a new for the newly cloned experiment, and optionally type a description.
  4. Click CLONE. The newly cloned experiment's detail pane appears. The experiment's status is Draft.
  5. Make changes to the experiment (for example, select different source code, change the hyperparameters, select a new initial weights input model, or other editable experiments components), see Modifying experiments.
  6. Right click the experiment > Enqueue > Select a queue > ENQUEUE.

Modifying experiments

When you rerun or tune experiments, you can modify any combination of the following:

Selecting source code

  1. In the experiment details pane, EXECUTION tab.
  2. Do any of the following:

    • Select a different repository, commit ID (tag name or the last commit, script, and working directory.
    • Discard or edit the uncommitted changes.
    • Change the installed Python packages and / or their versions.

Repository, commit ID, script, and working directory

  • Hover over SOURCE CODE > EDIT > Change the repository, commit ID, script file name, and / or the working directory > SAVE.

For example, this image shows the COMMIT ID list, where you can select a specific commit id, a tag, or the last commit in the branch.

Uncommitted changes

  • Hover over UNCOMMITTED CHANGES > DISCARD (delete) or EDIT > SAVE.

Installed packages and versions

  • Hover over INSTALLED PACKAGES > EDIT > Change the packages and / or versions > SAVE.

Tuning hyperparameters

Tune your experiment by adding, changing, or deleting hyperparameters and their values.

  1. In the experiment details pane, HYPER PARAMETERS tab.
  2. Hover over the HYPER PARAMETERS area and then click EDIT.
  3. Add, change, or delete hyperparameters.
  4. Click SAVE.

Select an initial weights input model

Select a different initial weights input model. For example, test your data with a teammate's model.

  1. In the experiment details pane, ARTIFACTS tab.
  2. In the Input Model area, click EDIT.
  3. In the SELECT MODEL dialog, select a model and then click CONFIRM.

Modifying the configuration

  1. In the experiment details pane, ARTIFACTS tab.
  2. In the MODEL CONFIGURATION area, click EDIT.
  3. Change the configuration.
  4. Click SAVE.

Setting the output destination

Set the output destination for checkpointing models and storing artifacts. For more information, see Checkpointing models in the "Examples" section.

  1. In the experiment details pane, EXECUTION tab.
  2. Hover over OUTPUT > EDIT > In DESTINATION, type the destination path > SAVE.

Specifying a base Docker image

Execute the experiment in a pre-configured Docker. You can create a base Docker image, see Base Docker image on the "Trains Agent Use Case Examples" page.

  1. In the experiment details pane, EXECUTION tab.
  2. Hover over AGENT CONFIGURATION > EDIT > In BASE DOCKER IMAGE, type the URL of the Docker image > SAVE.

Choosing a log level

  1. In the experiment details pane, EXECUTION tab.
  2. Hover over OUTPUT > EDIT > In LOG LEVEL, and then type the log level > SAVE.

Terminating running experiments

To terminate a running experiment, abort it. For example, if the experiment requires changes.

To terminate an experiment:

  • In the experiment table, right click the experiment > Abort > ABORT. The experiment status changes to Aborted. The Trains Web-App (UI) shows the results to that point, including the last iteration performed.

Read-only experiments (Publishing)

When you want to prevent changes to an experiment, make it read-only by Publishing it.

  • In the experiments table, right click the experiment > Click Publish. The status changes to Published.