Concepts and Architecture

Concepts

Tasks / Experiments

A Task in Trains is the internal representation of an experiment. More specifically, a Task is a code template defining the functionality and features which connect your Python experiment script to Trains. All the parts of an experiment connect to a Task, such as the models, hyperparameters, and logging. A Task code object instantiated from the Task class is, effectively, your experiment in Trains.

Artifacts

In Trains, artifacts include input models, output models, and other artifacts you can store with your experiment.

Trains automatically logs input models and output model for frameworks including PyTorch, TensorFlow, Keras, XGBoost, and scikit-learn. When an experiment runs, Trains logs the initial model weights file and the output model providing a link to each.

Registered artifacts are dynamically synchronized with Trains so that changes to the artifact are updated and appear in the experiment.

Uploaded artifacts are one-time, static uploads. These include the following:

  • Pandas DataFrames - dynamically updated DataFrames and one-time, static uploads
  • Files of any type, including image files
  • Folders - stored as ZIP files
  • Images - stored as PNG files
  • Dictionaries - stored as JSONs
  • Numpy arrays - stored as NPZ files
  • Objects of any other type that you require

Extend Trains artifact capabilities in many ways, such as manually specifying models, storing model snapshots, connecting configurations, and connecting setting class enumeration for experiments.

Logging

Automagical logging

Trains automatically logs all the following:

  • Git repository, branch, commit id, entry point and local git diff.
  • Python environment, including specific packages and versions.
  • stdout and stderr.
  • Resource Monitoring (CPU/GPU utilization, temperature, IO, network, and more).
  • Hyper-parameters:
    • ArgParser for command line parameters with currently used values.
    • TensorFlow Defines (absl-py).
  • Initial model weights file.
  • Model snapshots, with optional automatic upload to central storage. Storage options include shared folders, S3, GS, Azure, and Http.
  • Artifacts log & store, including shared folders, S3, GS, Azure, and Http.
  • TensorBoard/TensorBoardX scalars, metrics, histograms, images (with audio coming soon).
  • Matplotlib & Seaborn.

Explicit logging

In addition, Trains supports explicit logging. To add explicit reporting to your Python experiment script, get a Trains logger for your Task (this connects all explicit reporting to the experiment). Once you have a logger, you can add the following explicit reporting for tracking, analysis, and experiment comparisons:

  • Plot scalar metrics.
  • Plot any data using a variety of chart types, including histograms, confusion matrices, surface diagrams, 2D or 3D scatter diagrams.
  • log messages with log levels, including errors, warnings, debugging, and information.
  • Upload images.

Workers and queues

Using Trains, you can run your Python experiment scripts and then track, analyze, and compare experiments.

You can do more. You can implement Trains for experiment execution in Docker containers, local computer, remote computers, and on a Cloud using our intelligently designed agent. We refer to that agent as a worker.

Worker daemons fetch Tasks (the Trains internal representation of experiments) from the queues they listen to. Queues reside in the Trains backend and are managed by a Trains Server (either our demo Trains Server (https://demoapp.trains.allegro.ai/dashboard), or your own locally-hosted Trains Server). Workers are implemented with Trains Agent.

When a worker daemon fetches a Task from a queue, the worker does the following:

  1. Builds a cached Python virtual environment.
  2. Clones the experiment source code into that Python virtual environment.
  3. Installs the required Python package versions, if not previously installed and cached (worker intelligent design supports reuse of cached Python packages).
  4. Executes the experiment on a GPU machine, with logging and monitoring. The results are available to your in our demo Trains Server (https://demoapp.trains.allegro.ai/dashboard) from its user interface (Web-App), or your own locally-host Trains Server and Trains Web-App, if you deploy one.

Experiment states

In Trains, an experiment state describes an experiment's status / transition through your workflow using Trains and indicates whether the experiment is read-only or editable for tuning.

The following are the states (statuses) of experiments using Trains:

  • Draft - The experiment is not running (a worker is not running it) and can be enqueued for a worker daemon to fetch and execute it. The experiment is editable. Only Draft experiments are editable.
  • Pending - The experiment is in a queue waiting to be run by a worker and can be dequeued.
  • Running - A worker is running the experiment. A user can terminate the experiment and its status will become Aborted.
  • Completed - The experiment ran and terminated successfully.
  • Failed - The experiment ran and terminated with an error.
  • Aborted - The experiment ran and was manually or programmatically terminated.
  • Published - The experiment is read-only to keep the inputs and outputs unchanged.

Architecture

Trains

The Trains Python Client Package is an SDK making the Trains backend service available to you in your Python experiment scripts. It is composed of classes and methods providing control over your experiments (Tasks) and all the parts of an experiment, such as artifacts (input models, output models, model snapshots, and other artifacts), logging (automagical and explicit reporting), hyperparameters, configurations, and class enumeration.

For detailed information about the Trains Python Client Package, see the references pages in this documentation.

Trains Server

The Trains Server is the backend service infrastructure for Trains. It allows multiple users to collaborate and manage their experiments. Trains Server is composed of the following:

  • Web server including the Trains Web-App which is our user interface for tracking, comparing, and managing experiments.
  • API server which a RESTful API for:
    • Documenting and logging experiments, including information, statistics and results.
    • Querying experiments history, logs and results.
  • File server which is a locally-hosted file server for storing images and models making them easily accessible using the Trains Web-App.

The following diagram describes the Trains Server architecture.

For detailed information about deploying the Trains Server, see Deploying Trains Server.

Trains Web-App is the Trains user interface and is part of Trains Server. Use the Trains Web-App to:

  • track experiments
  • compare experiments
  • manage experiments

For detailed information about the Trains Web-App, see User Interface (Web-App).

Trains Agent

Trains Agent is a virtual environment and execution manager for DL / ML solutions on GPU machines. It integrates with the demo Trains Server, or your own locally-hosted Trains Server providing: experiment tracking, monitoring, and logging; configurable options, including GPU; and intelligent design for efficient Python packages management and caching. Trains Agent runs a worker.

The following diagram describes the Trains Agent architecture.


                                                                           +-----------------+
                                                                           |  GPU  Machine   |
Development Machine                                                        |                 |
+------------------------+                                                 | +-------------+ |
|    Data Scientist's    |                         +--------------+        | |TRAINS Agent | |
|      DL/ML Code        |                         |    WEB UI    |        | |             | |
|                        |                         |              |        | | +---------+ | |
|                        |                         |              |        | | |  DL/ML  | | |
|                        |                         +--------------+        | | |  Code   | | |
|                        |    User Clones Exp #1  / . . . . . . . /        | | |         | | |
| +-------------------+  |        into Exp #2    / . . . . . . . /         | | +---------+ | |
| |      TRAINS       |  |      +---------------/-_____________-/          | |             | |
| +---------+---------+  |      |                                          | |      ^      | |
+-----------|------------+      |                                          | +------|------+ |
            |                   |                                          +--------|--------+
 Auto-Magically                 |                                                   |
 Creates Exp #1                 |                                      The TRAINS Agent
             \          User Change Hyper-Parameters                Pulls Exp #2, setup the
             |                  |                                   environment & clone code.
             |                  |                                   Start execution with the
+------------|------------+     |            +--------------------+        new set of 
|  +---------v---------+  |     |            |   TRAINS-SERVER    |    Hyper-Parameters.
|  | Experiment #1     |  |     |            |                    |                 |
|  +-------------------+  |     |            |  Execution Queue   |                 |
|            ||           |     |            |                    |                 |
|  +-------------------+<-------+            |                    |                 |
|  |                   |  |                  |                    |                 |
|  | Experiment #2     |  |                  |                    |                 |
|  +-------------------<---------\           |                    |                 |
|                         |       ------------->---------------+  |                 |
|                         | User Send Exp #2 | |Execute Exp #2 +--------------------+
|                         | For Execution    | +---------------+  |
|     TRAINS-SERVER       |                  |                    |
+-------------------------+                  +--------------------+