Concepts and Architecture
This page introduces concepts in Trains and the Trains architecture.
Tasks / Experiments
A Task in Trains is the internal representation of an experiment. More specifically, a Task is a code template defining the functionality and features which connect your Python experiment script to Trains. All the parts of an experiment connect to a Task, such as the models, hyperparameters, and logging. A Task code object instantiated from the Task class is, effectively, your experiment in Trains.
When running your Task code in a development environment for the first time, Trains initializes a main execution Task object on the Trains backend (Trains Server) and assigns it a new Task ID. Initialize a Task in code by calling the Task.init method.
If the same code runs again, and is not Published, Trains overwrites the experiment's previous output data. If the previous Task is Published, Trains creates a new Task and assigns it a new Task Id. In this case, you see another Task in the Trains Web-App (UI), in the same project and with the same experiment name.
When running code, you can force Trains to create a new Task and assign it a new Task ID each time your code runs, by calling the Task.init method, and setting the reuse_last_task_id parameter argument to False.
When executing a Task remotely using Trains Agent, Trains does not create a new Task, even if reuse_last_task_id=False was specified in the code.
Trains supports multiple Task types for different workflows. When initializing a Task, set the Task type using the Task.init method, task_type parameter. The Task types you can set for task_type include:
- Task.TaskTypes.training (Default)
In Trains, artifacts include input models, output models, and other artifacts you can store with your experiment.
Trains tracks models with experiments, but stores information about models separately. This information includes the location of the model, configuration, class label enumeration, creating experiment, and tags. Once a model is stored, another experiment can connect to it. You can develop an experiment with one initial weights model on a dataset, and then run the experiment again with same dataset but a different input model and compare the result. You can also use the output model of one experiment as the input model of another. Publishing a model makes it read-only.
Trains automatically logs input models and output model for frameworks including PyTorch, TensorFlow, Keras, XGBoost, and scikit-learn. When an experiment runs, Trains logs the initial model weights file and the output model providing a link to each.
Trains dynamically synchronizes registered artifacts with the backend, updating artifact changes in the experiment.
Uploaded artifacts are one-time, static uploads. These include the following:
- Pandas DataFrames - dynamically updated DataFrames and one-time, static uploads
- Files of any type, including image files
- Folders - stored as ZIP files
- Images - stored as PNG files
- Dictionaries - stored as JSONs
- Numpy arrays - stored as NPZ files
Extend Trains artifact capabilities in many ways, such as manually specifying models, checkpointing models, connecting configurations, and connecting setting class enumeration for experiments.
Trains automatically logs all the following:
- Git repository, branch, commit id, entry point and local git diff.
- Python environment, including specific packages and versions.
- stdout and stderr.
- Resource Monitoring (CPU/GPU utilization, temperature, IO, network, and more).
- ArgParser for command line parameters with currently used values.
- TensorFlow Defines (absl-py).
- Initial model weights file.
- Model snapshots, with optional automatic upload to central storage. Storage options include shared folders, S3, GS, Azure, and Http.
- Artifacts log & store, including shared folders, S3, GS, Azure, and Http.
- TensorBoard/TensorBoardX scalars, metrics, histograms, media (images, audio, and video).
- Matplotlib & Seaborn.
In addition, Trains supports explicit reporting. Explicitly report the following by creating a Logger object for a Task, and then calling a Logger method (see the Explicit Reporting tutorial):
- Logging console messages, scalars and plots in several formats, tables, and media including images, audio, and video.
- Tracking hyperparameters using parameter dictionaries.
- Tracking environment variables.
Workers and queues
In addition to running your code locally, you can execute Tasks on remote computers, on a Cloud, and any local machine (the development machine or any other local machine) using our intelligently designed workers. Use trains-agent to create workers.
Workers fetch Tasks from queues which reside on trains-server.
A worker daemon (trains-agent daemon) fetches a Task from the queue(s) it listen to, and the worker does the following:
- Builds a cached Python virtual environment.
- Clones the experiment source code into that Python virtual environment.
- Installs the required Python package versions, if not previously installed and cached (worker intelligent design supports reuse of cached Python packages).
- Executes the experiment on a GPU machine, with logging and monitoring. The results are available in the demo Trains Server (https://demoapp.trains.allegro.ai/dashboard) from its user interface (Web-App), or your own locally-host Trains Server and Trains Web-App, if you deploy one.
In Trains, an experiment state describes an experiment's status / transition through your workflow using Trains and indicates whether the experiment is read-only or editable for tuning.
The following are the states (statuses) of experiments:
- Draft - The experiment is editable. Only Draft experiments are editable. The experiment is not running (a worker is not running it) and can be enqueued for a worker daemon to fetch and execute it.
- Pending - The experiment is in a queue waiting for a worker to fetch and execute it. The experiment can be dequeued, removing it from the queue.
- Running - A worker is running the experiment. A user can terminate the experiment and its status will become Aborted.
- Completed - The experiment ran and terminated successfully.
- Failed - The experiment ran and terminated with an error.
- Aborted - The experiment ran and was manually or programmatically terminated.
- Published - The experiment is read-only. Publishing an experiment set its state (status) to Published. Publish an experiment to prevent changes to its inputs and outputs. Later, you can clone a Published experiment, and make changes to the newly cloned experiment.
The Trains Python Client package is an SDK making the Trains backend service available to you in your Python experiment scripts. It is composed of classes and methods providing control over your experiments (Tasks) and all the parts of an experiment, such as artifacts (input models, output models, model snapshots, and other artifacts), logging (automagical and explicit reporting), hyperparameters, configurations, and class enumeration.
For detailed information about the Trains Python Client package, see the Trains Configuration Reference section.
The Trains Server is the backend service infrastructure for Trains. It allows multiple users to collaborate and manage their experiments. Trains Server is composed of the following:
- Web server including the Trains Web-App which is our user interface for tracking, comparing, and managing experiments.
- API server which a RESTful API for:
- Documenting and logging experiments, including information, statistics and results.
- Querying experiments history, logs and results.
- File server which is a self-hosted file server for storing media and models making them easily accessible using the Trains Web-App.
The following diagram describes the Trains Server architecture.
For detailed information about deploying the Trains Server, see Deploying Trains Server.
Trains Web-App is the Trains user interface and is part of Trains Server. Use the Trains Web-App to:
- track experiments
- compare experiments
- manage experiments
For detailed information about the Trains Web-App, see User Interface.
Trains Agent services container
As of Trains Server version 0.15, the dockerized deployment includes a Trains Agent services container which runs as part of the Docker container collection. The Trains Agent services container can work in conjunction with Trains Agent services mode (see services-mode on the "Trains Agent Reference" page, and Launching Trains Agent in services mode on the "Trains Agent Use Case Examples" examples page).
Trains Agent services mode will spin any Task enqueued into the dedicated services queue. Each Task launched in its own container will be registered as a new node in the system, providing tracking and transparency capabilities. This provides the ability to launch long-lasting jobs which previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks) for different use cases. For example, use Trains Agent Services for an auto-scaler service (spinning instances when the need arises, and the budget permits), a controller (implementing pipelines and more sophisticated DevOps logic), an optimizer (such as hyperparameter optimization or sweeping), and an application (such as interactive Bokeh apps for increased data transparency).
Training and inference Tasks
Do not enqueue training or inference Tasks into the services queue. They will put an unnecessary load on the server.
Trains Agent is a virtual environment and execution manager for DL / ML solutions on GPU machines. It integrates with the demo Trains Server, or a self-hosted Trains Server providing: experiment tracking, monitoring, and logging; configurable options, including GPU; and intelligent design for efficient Python packages management and caching. Trains Agent runs a worker.
The following diagram describes the Trains Agent architecture.
+-----------------+ | GPU Machine | Development Machine | | +------------------------+ | +-------------+ | | Data Scientist's | +--------------+ | |TRAINS Agent | | | DL/ML Code | | WEB UI | | | | | | | | | | | +---------+ | | | | | | | | | DL/ML | | | | | +--------------+ | | | Code | | | | | User Clones Exp #1 / . . . . . . . / | | | | | | | +-------------------+ | into Exp #2 / . . . . . . . / | | +---------+ | | | | TRAINS | | +---------------/-_____________-/ | | | | | +---------+---------+ | | | | ^ | | +-----------|------------+ | | +------|------+ | | | +--------|--------+ Auto-Magically | | Creates Exp #1 | The TRAINS Agent \ User Change Hyper-Parameters Pulls Exp #2, setup the | | environment & clone code. | | Start execution with the +------------|------------+ | +--------------------+ new set of | +---------v---------+ | | | TRAINS-SERVER | Hyper-Parameters. | | Experiment #1 | | | | | | | +-------------------+ | | | Execution Queue | | | || | | | | | | +-------------------+<-------+ | | | | | | | | | | | | Experiment #2 | | | | | | +-------------------<---------\ | | | | | ------------->---------------+ | | | | User Send Exp #2 | |Execute Exp #2 +--------------------+ | | For Execution | +---------------+ | | TRAINS-SERVER | | | +-------------------------+ +--------------------+