Trains Configuration Reference
Trains is now ClearML
This documentation applies to the legacy Trains versions. For the latest documentation, see ClearML.
This reference page provides detailed information about the configurable options for Trains and Trains Agent. Trains and Trains Agent use the same configuration file trains.conf
. Two of the three sections are used by both. A single configuration file avoids duplications.
The three sections of the configuration file are as follows:
- agent - Contains Trains Agent configuration options.
- api - Contains Trains and Trains Agent configuration options for Trains Server.
- sdk - Contains Trains and Trains Agent configuration options for Trains Python Package and Trains Server.
The following are example configuration files from the Trains repository, and the trains-agent
repository:
- trains repository configuration file example - This
trains.conf
example does not contain anagent
section, because it is for Trains which can run without Trains Agent. - trains-agent repository configuration file example - This
trains.conf
example does contain anagent
section, because it is for Trains Agent.
Why is the same configuration file used for Trains and Trains Agent?
Trains and Trains Agent both use the api
and sdk
sections, as well as options for Trains credentials. A single configuration file avoids duplication and makes it easier for you to find the options you require.
Editing your configuration file
To add, change, or delete options, edit your configuration file.
To edit your Trains configuration file:
-
Open your configuration file for editing, depending upon your operating system:
- Linux -
~/trains.conf
- Mac -
$HOME/trains.conf
- Windows -
\User\<username>\trains.conf
- Linux -
-
In the required section (sections listed on this page), add, modify, or remove your required options.
- Save your configuration file.
agent
The agent
section contains options to configure Trains Agent for Git credentials, package managers, cache management, workers, and Docker for workers.
agent
(dict)
- Dictionary of top-level Trains Agent options.
agent.cuda_version
(float)
-
The CUDA version to use.
- If specified, this is the CUDA version used.
- If not specified, the CUDA version is automatically detected.
Alternatively, override this option with the environment variable
CUDA_VERSION
.
agent.cudnn_version
(float)
-
The cuDNN version to use.
- If specified, this is the cuDNN version used.
- If not specified, the cuDNN version is automatically detected.
Alternatively, override this option with the environment variable
CUDNN_VERSION
.
agent.docker_apt_cache
(string)
- The apt (Linux package tool) cache folder for mapping Ubuntu package caching into Docker.
agent.docker_force_pull
(bool)
-
Always update the Docker image by forcing a Docker
pull
before running an experimentThe values are:
true
- Always update the Docker image.false
- Do not always update.
agent.docker_init_bash_script
(string)
- Specify an initial bash script to execute at the startup of any docker. All lines are executed regardless of their exit code.
For example:
docker_init_bash_script = [
"echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean",
"chown -R root /root/.cache/pip",
"apt-get update",
"apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0",
"(which {python_single_digit} && {python_single_digit} -m pip --version) || apt-get install -y {python_single_digit}-pip",
]
where,
{python_single_digit}
translates to python3
or python2
, based upon the requested Python version.
agent.docker_pip_cache
(string)
- The pip (Python package tool) cache folder for mapping Python package caching into Docker.
agent.extra_docker_arguments
([string])
- Optional arguments to pass to the Docker image. These are local for this agent, and will not be updated in the experiment's
docker_cmd
section. For example,["--ipc=host", ]
.
agent.extra_docker_shell_script
([string])
- An optional shell script to run in the Docker, when the Docker starts, before the experiment starts. For example,
["apt-get install -y bindfs", ]
agent.force_git_ssh_protocol
(bool)
-
Force Git protocol to use SSH regardless of the Git URL. This assumes the Git user/pass are blank.
The values are:
true
- Forcefalse
- Do not force
agent.force_git_ssh_port
(int)
- Force a specific SSH port when converting HTTP to SSH links. The domain remains unchanged.
agent.git_host
(string)
- Limit Git credentials usage to this host. The environment variable
TRAINS_AGENT_GIT_HOST
overrides this configuration option.
agent.git_pass
(string)
-
Git repository password.
- If using Git SSH credentials, do not specify this option.
- If not using Git SSH credentials, use this option to specify a Git password for cloning your repositories.
agent.git_user
(string)
-
Git repository username.
- If using Git SSH credentials, do not specify this option.
- If not using Git SSH credentials, use this option to specify a Git password for cloning your repositories.
agent.python_binary
(string)
- Set the Python version to use when creating the virtual environment, and when launching the experiment. For example,
/usr/bin/python3
or/usr/local/bin/python3.6
.
agent.reload_config
(bool)
-
Indicates whether to reload the configuration each time the worker daemon is executed.
The values are:
true
false
agent.translate_ssh
(bool)
agent.venvs_dir
(string)
- The target folder for virtual environments builds which are created when executing experiment.
agent.worker_id
(string)
-
When creating a worker, assign the worker a name.
- If specified, a unique name for the worker. For example,
trains-agent-machine1:gpu0
. -
If not specified, the following is used
<hostname>:<process_id>
.For example,
MyHost:12345
.Alternatively, specify the environment variable
TRAINS_WORKER_NAME
to override this worker name.
- If specified, a unique name for the worker. For example,
agent.worker_name
(string)
-
Use to replace the hostname when creating a worker, if
agent.worker_id
is not specified. For example, ifworker_name
isMyMachine
and the process_id is12345
, then the worker is nameMyMachine.12345
.Alternatively, specify the environment variable
TRAINS_WORKER_ID
to override this worker name.
agent.default_docker
- Dictionary containing the default options for workers in Docker mode.
agent.default_docker.arguments
(string)
- If running a worker in Docker mode, this option specifies the options to pass to the Docker container.
agent.default_docker.image
(string)
- If running a worker in Docker mode, this option specifies the default Docker image to use.
agent.package_manager
agent.package_manager
(dict)
- Dictionary containing the options for the Python package manager. The currently supported package managers are pip, conda, and if the repository contains a poetry.lock file, poetry.
agent.package_manager.conda_channels
([string])
- If conda is used, then this is list of conda channels to use when installing Python packages.
agent.package_manager.extra_index_url
([string])
- A list of URLs for additional artifact repositories when installing Python packages.
agent.package_manager.force_upgrade
(bool)
-
Indicates whether to force an upgrade of Python packages.
The values are:
true
- Forcefalse
- Do not force
agent.package_manager.pip_version
(string)
- The
pip
version to use. For example,"<20"
,"==19.3.1"
,""
(empty string will install the latest version).
agent.package_manager.post_optional_packages
(string)
- A list of optional packages that will be installed after the required packages. If the installation of an optional post package fails, the package is ignored and the virtual environment process continues.
agent.package_manager.post_packages
([string])
- A list of packages that will be installed after the required packages.
agent.package_manager.system_site_packages
(bool)
-
Indicates whether Python packages for virtual environments are inherited from the system when building a virtual environment for an experiment.
The values are:
true
- Inheritfalse
- Do not inherit (load Python packages)
agent.package_manager.torch_nightly
(bool)
-
Indicates whether to support installing PyTorch Nightly builds.
The values are:
true
- If a stabletorch
wheel is not found, install the nightly build.false
- Do not install.
PyTorch Nightly Builds
torch
nightly builds are ephemeral and are deleted from time to time.
agent.package_manager.type
(string)
-
Indicates the type of Python package manager to use.
The values are:
pip
- use pip as the package manager or, if apoetry.lock
file exists in the repository, use poetry as the package managerconda
- use conda as the package manager
agent.pip_download_cache
agent.pip_download_cache
(dict)
- Dictionary containing pip download cache options.
agent.pip_download_cache.enabled
(bool)
-
Indicates whether to use a specific cache folder for Python package downloads.
The values are:
true
- Use a specific folder which is specified in the optionagent.pip_download_cache.path
false
- Do not use a specific folder.
agent.pip_download_cache.path
(string)
- If
agent.pip_download_cache.enabled
istrue
, then this specifies the cache folder.
agent.vcs_cache
agent.vcs_cache
(dict)
- Dictionary containing version control system clone cache folder.
agent.vcs_cache.enabled
(bool)
-
Indicates whether the version control system cache is used.
The values are:
true
- Use cachefalse
- Do not use cache
agent.vcs_cache.path
(string)
- The version control system cache clone folder when executing experiments.
agent.venv_update
agent.venv_update
(dict)
- Dictionary containing virtual environment update options.
agent.venv_update.enabled
(bool)
-
Indicates whether to use accelerated Python virtual environment building (this is a beta feature).
The values are:
true
- Acceleratefalse
- Do not accelerate (default value)
api
The api
section contains configuration options for the Trains Server API, web, and file servers and credentials.
api.api_server
(string)
- The URL of your Trains API server. For example,
https://api.MyDomain.com
.
api.web_server
(string)
- The URL of your Trains web server. For example,
https://app.MyDomain.com
.
api.files_server
(string)
- The URL of your Trains file server. For example,
https://files.MyDomain.com
.
You must use a secure protocol
For api.web_server
, api.files_server
, and api.files_server
. You must use a secure protocol, "https". Do not use "http".
api.credentials
api.credentials
(dict)
- Dictionary of API credentials.
api.credentials.access_key
(string)
- Your Trains access key.
api.credentials.secret_key
(string)
- Your Trains credentials.
api.verify_certificate
(bool)
-
Indicates whether to verify the host SSL certificate.
The values are:
True
- VerifyFalse
- Do not verify.
Set to
False
only if required.
sdk
The sdk
section contains configuration options for the Trains Python Package and related options, including storage, metrics, network, AWS S3 buckets and credentials, Google Cloud Storage,
Azure Storage, log, and development.
sdk.aws
sdk.aws.boto3
sdk.aws.boto3
(dict)
- Dictionary of AWS Storage, Boto2 options.
sdk.aws.boto3.pool_connections
(integer)
- For AWS Boto3, The maximum number of Boto3 pool connections.
sdk.aws.boto3.max_multipart_concurrency
(integer)
- For AWS Boto3, the maximum number of threads making requests for a transfer.
sdk.aws.s3
sdk.aws.s3
(dict)
- Dictionary of AWS Storage, AWS S3 options.
sdk.aws.s3.key
(string)
- For AWS S3, the default access key for any bucket that is not specified in the
sdk.aws.s3.credentials
section.
sdk.aws.s3.region
(string)
- For AWS S3, the default region name for any bucket that is not specified in the
sdk.aws.s3.credentials
section.
sdk.aws.s3.secret
(string)
- For AWS S3, the default secret access key for any bucket that is not specified in the
sdk.aws.s3.credentials
section.
sdk.aws.s3.credentials
([dict])
- List of dictionaries, for AWS S3, each dictionary can contain the credentials for individual S3 buckets or hosts for individual buckets.
sdk.aws.s3.credentials.bucket
(string)
-
For AWS S3, if specifying credentials for individual buckets, then this is the bucket name for an individual bucket.
See the AWS documentation for restrictions and limitations on bucket naming.
sdk.aws.s3.credentials.host
(string)
- For AWS S3, if specifying credentials for individual buckets by host, then this option is the host URL and optionally the port number.
sdk.aws.s3.credentials.key
(string)
-
For AWS S3:
- If specifying individual bucket, then this is the access key for the bucket.
- If specifying individual buckets by host, then this is access key for all buckets on the host.
sdk.aws.s3.credentials.multipart
(bool)
-
For AWS S3, if specifying credentials for individual buckets by host, then this indicates whether to allow multipart upload of a single object (object as a set of parts).
The values are:
true
- Enabledfalse
- Disabled
sdk.aws.s3.credentials.secret
(bool)
-
For AWS S3:
- If specifying credentials for a specific bucket, then this is the secret key for the bucket.
- If specifying credentials for individual buckets by host, then this is the secret key for all buckets on the host.
sdk.aws.s3.credentials.secure
(string)
-
For AWS S3, if specifying credentials for individual buckets by host, then this indicates whether the host is secure.
The values are:
true
- Securefalse
- Not secure
sdk.azure
sdk.azure.storage.containers
([dict])
- List of dictionaries, each dictionary contains credentials for an Azure Storage container.
sdk.azure.storage.containers.account_key
(string)
- For Azure Storage, this is the credentials key.
sdk.azure.storage.containers.account_name
(string)
- For Azure Storage, this is account name.
sdk.azure.storage.containers.container_name
(string)
- For Azure Storage, this the container name.
sdk.development
sdk.development
(dict)
- Dictionary of development mode options.
sdk.development.default_output_uri
(string)
- The default output destination for model checkpoints (snapshots) and artifacts. If the
output_uri
parameter is not provided when calling theTask.init
method, then use the destination indefault_output_uri
.
sdk.development.detect_with_conda_freeze
(bool)
- If this flag is
true
(default isfalse
), instead of analyzing the code with Pigar, analyze withconda freeze
.
sdk.development.detect_with_pip_freeze
(bool)
- If this flag is
true
(default isfalse
), instead of analyzing the code with Pigar, analyze withpip freeze
.
sdk.development.force_analyze_entire_repo
(bool)
-
Support optimization of requirements analysis.
The values are:
true
- Always analyze the entire repository.false
- If the entry point script does not contain other local files, then analyze only the entry point script (do not analyze the entire repository).
sdk.development.store_code_diff_from_remote
(bool)
-
Store the uncommitted code diff from remote HEAD instead of local HEAD.
The values are:
true
- Diff from remote HEAD.false
- Diff from local HEAD. (default)
sdk.development.store_uncommitted_code_diff
(bool)
-
For development mode, indicates whether to store the uncommitted
git diff
orhg diff
in the experiment manifestThe values are:
true
- Store thediff
in thescript.requirements.diff
sectionfalse
- Do not store the diff.
sdk.development.support_stopping
(bool)
-
For development mode, indicates whether to allow stopping an experiment if the experiment was aborted externally, its status was changed, or it was reset.
The values are:
true
- Allowfalse
- Do not allow
sdk.development.task_reuse_time_window_in_hours
(float)
- For development mode, the number of hours after which an experiment with the same project name and experiment name is reused. This setting allows you to control reuse of old experiments.
sdk.development.vcs_repo_detect_async
(bool)
-
For development mode, indicates whether to run version control repository detection asynchronously.
The values are:
true
- Run asynchronouslyfalse
- Do not run asynchronously
sdk.development.worker
sdk.development.worker
(dict)
- Dictionary of development mode options for workers.
sdk.development.worker.log_stdout
(bool)
-
For development mode workers, indicates whether all stdout and stderr messages are logged.
The values are:
True
- Log allfalse
- Do not log all
sdk.development.worker.ping_period_sec
(integer)
- For development mode workers, the interval in seconds for a worker to ping the server testing connectivity.
sdk.development.worker.report_global_mem_used
(bool)
-
Indicates whether to report memory usage for the entire machine, or the running process and its sub processes only.
The values are:
true
- Report for the entire machine.false
- Report for the running process and its sub processes only. (default)
sdk.development.worker.report_period_sec
(integer)
- For development mode workers, the interval in seconds for a development mode Trains worker to report.
sdk.google.storage
sdk.google.storage
(dict)
- Dictionary of Google Cloud Storage credentials.
sdk.google.storage.project
(string)
- For Google Cloud Storage, the name of project.
sdk.google.storage.credentials_json
(string)
- For Google Cloud Storage, the file path for the default Google storage credentials JSON file.
sdk.google.storage.credentials.bucket
(string)
- For Google Cloud Storage, if specifying credentials by the individual bucket, the name of the bucket.
sdk.google.storage.credentials.credentials_json
(string)
- For Google Cloud Storage, if specifying credentials by the individual bucket, the file path for the default Google storage credentials JSON file.
sdk.google.storage.credentials.project
(string)
- For Google Cloud Storage, if specifying credentials by the individual bucket, the name of the project.
sdk.google.storage.credentials.subdir
(string)
- For Google Cloud Storage, if specifying credentials by the individual bucket, a subdirectory within the bucket.
sdk.log
sdk.log
(dict)
- Dictionary of log options.
sdk.log.disable_urllib3_info
(bool)
-
Indicates whether to disable
urllib3
info messages.The values are:
true
- Disablefalse
- Do not disable
sdk.log.null_log_propagate
(bool)
-
As debugging feature, indicates whether to allow null log messages to propagate to the root logger (so they appear as stdout).
The values are:
true
- Allowfalse
- Do not allow
sdk.log.task_log_buffer_capacity
(integer)
- The maximum capacity of the log buffer.
sdk.metrics
sdk.metrics
(dict)
- Dictionary of metrics options.
sdk.metrics.file_history_size
(string)
- The history size for debug files per metric / variant combination. For each metric / variant combination,
file_history_size
indicates the number of files stored in the upload destination. Files are recycled so thatfile_history_size
is the maximum number of files at any time.
sdk.metrics.matplotlib_untitled_history_size
(int)
- The maximum history size for
matplotlib
imshow
files per plot title. File names for the uploaded images are recycled so that no more than the value ofmatplotlib_untitled_history_size
images are stored in the upload destination for each matplotlib plot title.
sdk.metrics.plot_max_num_digits
(int)
- The maximum number of digits after the decimal point in plot reporting. This can reduce the report size.
sdk.metrics.tensorboard_single_series_per_graph
(bool)
-
Indicates whether plots appear using TensorBoard behavior where each series is plotted in its own graph (plot-per-graph).
The values are:
true
- Support TensorBoard behaviorfalse
- Do not
sdk.metrics.images
sdk.metrics.images
(dict)
- Dictionary of metrics images options.
sdk.metrics.images.format
(string)
- The image file format for generated debug images (e.g., JPEG).
sdk.metrics.images.quality
(integer)
- The image quality for generated debug images.
sdk.metrics.images.subsampling
(integer)
- The image subsampling for generated debug images.
sdk.network
sdk.network.iteration
(dict)
- Dictionary of network iteration options.
sdk.network.iteration.max_retries_on_server_error
` (integer)
- For retries when getting frames from the server, if the server returned an error (http code 500), then this is the maximum number of retries.
sdk.network.iteration.retry_backoff_factor_sec
- For retries when getting frames from the server, this is backoff factor for consecutive retry attempts. This is used to determine the number of seconds between retries. The retry backoff factor is calculated as {backoff factor} * (2 ^ ({number of total retries} - 1)).
sdk.network.metrics
sdk.network.metrics
(dict)
- Dictionary of network metrics options.
sdk.network.metrics.file_upload_starvation_warning_sec
(integer)
- The number of seconds before a warning is issued that file-bearing events are sent for upload, but no uploads occur.
sdk.network.metrics.file_upload_threads
(integer)
- The number of threads allocated to uploading files when transmitting metrics for a specific iteration.
sdk.storage
sdk.storage.cache
sdk.storage.cache
(dict)
- Dictionary of storage cache options.
sdk.storage.cache.path_substitution.local_prefix
(string)
- Local directory structure.
sdk.storage.cache.path_substitution.registered_prefix
(string)
- Use to replace the prefix of a registered local path with the prefix matching the local directory structure. This is a list of dictionaries and during a lookup, the first match executes. The Windows path separator must be escaped ("\").
The replacement is text-based. The replacement ignores logical parts of a path.
For example, the rule:
{
registered_prefix: "/opt/mydir"
local_prefix: "/tmp/data"
}
evaluates the path /opt/mydirnew/hello.txt
by matching it to the path /tmp/datanew/hello.txt
.
sdk.storage.cache.path_substitution.replace_linux_sep
(string)
-
Indicates whether to enable path separator for Linux.
The values are:
true
- Enablefalse
- Disable
sdk.storage.cache.path_substitution.replace_windows_sep
` (string)
-
Indicates whether to enable path separator for Windows.
The values are:
true
- Enablefalse
- Disable
You cannot set both replace_linux_sep and replace_windows_sep to True.
If both are set to True, an exception is raised.
sdk.storage.cache.default_base_dir
(string)
- The default base directory for caching. The default is the system temp folder for caching.
sdk.storage.cache.size.cleanup_margin_percent
(integer)
- The percentage of cache to clean up during a cleanup pass. For example, if the cache size is
30G
and the cleanup_margin_percent is10%
, then the cache will contain at most27 GB
after the cleanup.
sdk.storage.cache.size.min_free_bytes
(integer)
- The minimum cache drive size (GB) free space. For no minimum, use
0
or a negative number.
sdk.storage.cache.size.max_used_bytes
(integer)
- The maximum size (GB) of a file to cache. For no limit, use 0 or a negative number.
sdk.storage.direct_access
sdk.storage.direct_access
(dict)
- Dictionary of storage direct access options.
sdk.storage.direct_access.url
(string)
- Specify a list of direct access objects using glob patterns which matches sets of files using wildcards. Direct access objects are not downloaded or cached.