Concepts and Architecture

This page introduces concepts in Trains and the Trains architecture.

Tasks

The Task class is the Trains Python Client package multipurpose class which supports experimentation and various workflows. In experimentation, a Task object connects your experiment code to Trains Server, where Trains stores it. All the parts of an experiment connect to a Task. This includes models, hyperparameters, and logging. A Task is, effectively, an experiment in Trains. Once it is stored in Trains Server, you can rerun the Task (experiment), reproduce it, and tune it.

Task types

Trains supports multiple Task types for different workflows. When initializing a Task, set the Task type using the Task.init method, task_type parameter. The Task types you can set for task_type include:

  • Task.TaskTypes.training (Default)
  • Task.TaskTypes.testing
  • Task.TaskTypes.application
  • Task.TaskTypes.controller
  • Task.TaskTypes.data_processing
  • Task.TaskTypes.inference
  • Task.TaskTypes.monitor
  • Task.TaskTypes.optimizer
  • Task.TaskTypes.qc
  • Task.TaskTypes.service
  • Task.TaskTypes.custom

Logging

Trains supports automatic logging (automatic when you initialize a Task object) and explicit reporting (calls to the Trains Python Client package Logger class methods).

See the explicit reporting examples and the explicit reporting tutorial.

Automatic logging

Trains automatically logs all the following:

  • Git repository, branch, commit id, entry point and local git diff.
  • Python environment, including specific packages and versions.
  • stdout and stderr.
  • Resource Monitoring (CPU/GPU utilization, temperature, IO, network, and more).
  • Hyperparameters:

    • ArgParser for command line parameters with currently used values.
    • TensorFlow Defines (absl-py).
  • Initial model weights file.

  • Model snapshots, with optional automatic upload to central storage. Storage options include shared folders, S3, GS, Azure, and Http.
  • Artifacts logged and stored, including shared folders, S3, GS, Azure, and http.
  • TensorBoard/TensorBoardX scalars, metrics, histograms, media (images, audio, and video).
  • Matplotlib, Plotly, and Seaborn.

Explicit reporting

In addition, Trains supports explicit reporting. Explicitly report the following:

  • Logging console messages, scalars and plots in several formats, tables, and media including images, audio, and video.
  • Tracking hyperparameters using parameter dictionaries.
  • Tracking environment variables.

Artifacts

Trains tracks models (input models and output models), and other objects in several formats as experiment artifacts. Other artifact objects can be uploaded and dynamically tracked, or uploaded without tracking.

Tracked model details include network configurations, class label enumeration, and tags associated with models. Once Trains stores a model in Trains Server, you can reuse it in any experiment. For example, reproduce an experiment, run an experiment with the same model and a different dataset, or run one experiment with the model from another experiment.

Additionally, Trains supports model checkpoints (snapshots), which you can use to save interim models. For example, save the best model, and continue experimentation with that best model in the same or a different experiment.

See the Keras, PyTorch, and TensorFlow model upload examples.

Debug samples

Trains automatically logs debug samples, allowing you to track and analyze your development process. Debug samples include images, audio, and video. You can also report debug samples explicitly.

See the image, media (images, audio, and video), and text reporting examples.

Storage

Trains automatically logs the storage of models and debug samples locally. You can also configure Trains for storing models and debug samples, as well as other objects as artifacts, in any of the supported types of storage, which include local and shared folders, S3 buckets, Google Cloud Storage, and Azure Storage.

See the artifacts example, demonstrating storage for artifacts and models, and the media reporting example, demonstrating storage for debug samples.

Workers and queues

In addition to running your code locally, you can execute Tasks on remote computers, on a Cloud, and any local machine (the development machine or any other local machine) using our intelligently designed workers. Use trains-agent to create workers.

Workers fetch Tasks from queues which reside on trains-server.

A worker daemon (trains-agent daemon) fetches a Task from the queue(s) it listen to, and the worker does the following:

  1. Builds a cached Python virtual environment.
  2. Clones the experiment source code into that Python virtual environment.
  3. Installs the required Python package versions, if not previously installed and cached (worker intelligent design supports reuse of cached Python packages).
  4. Executes the experiment on a GPU machine, with logging and monitoring. The results are available in the demo Trains Server (https://demoapp.trains.allegro.ai/dashboard) from its user interface (Web-App), or your own locally-host Trains Server and Trains Web-App (UI), if you deploy one.

Experiment states

In Trains, an experiment state describes an experiment's status / transition through your workflow using Trains and indicates whether the experiment is read-only or editable for tuning.

The following are the states (statuses) of experiments:

  • Draft - The experiment is editable. Only experiments whose status is Draft are editable. The experiment is not running (a worker is not running it) and can be enqueued for a worker daemon to fetch and execute it.
  • Pending - The experiment is in a queue waiting for a worker to fetch and execute it. The experiment can be dequeued, removing it from the queue.
  • Running - A worker is running the experiment. A user can terminate the experiment and its status will become Aborted.
  • Completed - The experiment ran and terminated successfully.
  • Failed - The experiment ran and terminated with an error.
  • Aborted - The experiment ran and was manually or programmatically terminated.
  • Published - The experiment is read-only. Publish an experiment to prevent changes to its inputs and outputs. Later, you can clone a Published experiment, and make changes to the newly cloned experiment.

Architecture

Trains

The Trains Python Client package is an SDK making the Trains backend service available to you in your Python experiment scripts. It is composed of classes and methods providing control over your experiments (Tasks) and all the parts of an experiment, such as artifacts (input models, output models, model snapshots, and other artifacts), logging (automagical and explicit reporting), hyperparameters, configurations, and class enumeration.

For detailed information about the Trains Python Client package, see the Trains Python Client Reference section.

Trains Server

The Trains Server is the backend service infrastructure for Trains. It allows multiple users to collaborate and manage their experiments. Trains Server is composed of the following:

  • Web server including the Trains Web-App (UI) which is our user interface for tracking, comparing, and managing experiments.
  • API server which a RESTful API for:

    • Documenting and logging experiments, including information, statistics and results.
    • Querying experiments history, logs and results.
  • File server which is a self-hosted file server for storing media and models making them easily accessible using the Trains Web-App (UI).

image

For detailed information about deploying the Trains Server, see Deploying Trains Server.

Trains Web-App is the Trains user interface and is part of Trains Server.

Use the Trains Web-App (UI) to:

  • track experiments
  • compare experiments
  • manage experiments

For detailed information about the Trains Web-App (UI), see User Interface.

Trains Agent services container

As of Trains Server version 0.15, the dockerized deployment includes a Trains Agent services container which runs as part of the Docker container collection. The Trains Agent services container can work in conjunction with Trains Agent services mode (see services-mode on the "Trains Agent Reference" page, and Launching Trains Agent in services mode on the "Trains Agent Use Case Examples" examples page).

Trains Agent services mode will spin any Task enqueued into the dedicated services queue. Each Task launched in its own container will be registered as a new node in the system, providing tracking and transparency capabilities. This provides the ability to launch long-lasting jobs which previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks) for different use cases. For example, use Trains Agent Services for an auto-scaler service (spinning instances when the need arises, and the budget permits), a controller (implementing pipelines and more sophisticated DevOps logic), an optimizer (such as hyperparameter optimization or sweeping), and an application (such as interactive Bokeh apps for increased data transparency).

Training and inference Tasks

Do not enqueue training or inference Tasks into the services queue. They will put an unnecessary load on the server.

Trains Agent

Trains Agent is a virtual environment and execution manager for DL / ML solutions on GPU machines. It integrates with the demo Trains Server, or a self-hosted Trains Server providing: experiment tracking, monitoring, and logging; configurable options, including GPU; and intelligent design for efficient Python packages management and caching. Trains Agent runs a worker.

The following diagram describes Trains Agent interaction with Trains Server for experiment management and ML-Ops.

image

For detailed information about installing and configuring Trains Agent, see Installing and Configuring Trains Agent in the "Deploying Trains" section, and Trains Agent Reference.