FAQ
Trains is now ClearML
This documentation applies to the legacy Trains versions. For the latest documentation, see ClearML.
General Information
Models
- How can I sort models by a certain metric?
- Can I store more information on the models?
- Can I store the model configuration file as well?
- I am training multiple models at the same time, but I only see one of them. What happened?
- Can I log input and output models manually?
Experiments
- I noticed I keep getting the message "warning: uncommitted code". What does it mean?
- I do not use argparse for hyperparameters. Do you have a solution?
- I noticed that all of my experiments appear as "Training". Are there other options?
- Sometimes I see experiments as running when in fact they are not. What's going on?
- My code throws an exception, but my experiment status is not "Failed". What happened?
- CERTIFICATE_VERIFY_FAILED When I run my experiment, I get an SSL Connection error . Do you have a solution?
- How do I modify experiment names once they have been created?
- Using Conda and the "typing" package, I get the error "AttributeError: type object 'Callable' has no attribute '_abc_registry'". How do I fix this?
- My Trains Server disk space usage is too high. What can I do about this?
- Can I change the random seed my experiment uses?
- In the Web UI, I can't access files that my experiment stored. Why not?
- I get the message "TRAINS Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start". What does it mean?
- Can I control what Trains automatically logs?
Graphs and Logs
- The first log lines are missing from the experiment log tab. Where did they go?
- Can I create a graph comparing hyperparameters vs model accuracy?
- I want to add more graphs, not just with TensorBoard. Is this supported?
- How can I report more than one scatter 2D series on the same plot?
GIT and Storage
- Is there something Trains can do about uncommitted code running?
- I read there is a feature for centralized model storage. How do I use it?
- When using PyCharm to remotely debug a machine, the Git repo is not detected. Do you have a solution?
Also, see Git is not well-supported in Jupyter...
Remote Debugging (Trains PyCharm Plugin)
Jupyter
- I am using Jupyter Notebook. Is this supported?
- Git is not well-supported in Jupyter, so we just gave up on committing our code. Do you have a solution?
scikit-learn
Trains Configuration
- How do I explicitly specify the Trains configuration file to be used?
- How can I override Trains credentials from the OS environment?
- How can I track OS environment variables with experiments?
Trains Server Deployment
-
How do I deploy Trains Server on:
- Can I deploy Trains Server on Kubernetes clusters?
- Can I create a Helm Chart for Trains Server Kubernetes deployment?
- My Docker cannot load a local host directory on SELinux?
Trains Server Configuration
- How do I configure Trains Server for sub-domains and load balancers?
- Can I add web login authentication to Trains Server?
- Can I modify a non-responsive task settings?
Trains Server Troubleshooting
- How do I fix Docker upgrade errors?
- Why is web login authentication not working?
- How do I bypass a proxy configuration to access my local Trains Server?
- Trains is failing to update Trains Server. I get an error 500 (or 400). How do I fix this?
- Why is my Trains Web-App (UI) not showing any data?
Trains Agent
Trains API
General Information
How do I know a new version came out?
Starting with Trains v0.9.3, Trains issues a new version release notification, which appears in the log and is output to the console, if a Python experiment script is run.
For example, when a new Trains Python Package version is available, the notification is:
TRAINS new package available: UPGRADE to vX.Y.Z is recommended!
When a new Trains Server version is available, the notification is:
TRAINS-SERVER new version available: upgrade to vX.Y is recommended!
Models
How can I sort models by a certain metric?
Trains associates models with the experiments that created them. To sort experiments by a metric, in the Trains Web-App (UI), add a custom column in the experiments table and sort by that metric column.
Can I store more information on the models?
Yes! For example, you can use the Task.set_model_label_enumeration method to store class enumeration:
Task.current_task().set_model_label_enumeration( {"label": int(0), } )
For more information about Task
class methods, see the Task Class reference page.
Can I store the model configuration file as well?
Yes! Use the Task.set_model_config method:
Task.current_task().set_model_config("a very long text with the configuration file's content")
I am training multiple models at the same time, but I only see one of them. What happened?
Currently, in the experiment info panel, Trains shows only the last associated model. In Trains Web-App (UI), on the Projects page, the Models tab shows all models.
This will be improved in a future version.
Can I log input and output models manually?
Yes! Use the InputModel.import_model method and Task.connect methods to manually connect an input model. Use the Task.update_weights method to manually connect a model weights file.
input_model = InputModel.import_model(link_to_initial_model_file)
Task.current_task().connect(input_model)
OutputModel(Task.current_task()).update_weights(link_to_new_model_file_here)
For more information about models, see InputModel and OutputModel classes.
Experiments
I noticed I keep getting the message "warning: uncommitted code". What does it mean?
This message is only a warning. Trains not only detects your current repository and git commit, but also warns you if you are using uncommitted code. Trains does this because uncommitted code means this experiment will be difficult to reproduce. You can see uncommitted changes in the Trains Web-App (UI), experiment info panel, EXECUTION tab.
I do not use argparse for hyperparameters. Do you have a solution?
Yes! Trains supports connecting hyperparameter dictionaries to experiments using the Task.connect method.
For example, to log the hyperparameters learning_rate
, batch_size
, display_step
,
model_path
, n_hidden_1
, and n_hidden_2
:
# Create a dictionary of parameters
parameters_dict = { 'learning_rate': 0.001, 'batch_size': 100, 'display_step': 1,
'model_path': "/tmp/model.ckpt", 'n_hidden_1': 256, 'n_hidden_2': 256 }
# Connect the dictionary to your TRAINS Task
parameters_dict = Task.current_task().connect(parameters_dict)
I noticed that all of my experiments appear as "Training" Are there other options?
Yes! When creating experiments and calling Task.init, you can provide an experiment type. Trains supports multiple experiment types. For example:
task = Task.init(project_name, task_name, Task.TaskTypes.testing)
Sometimes I see experiments as running when in fact they are not. What's going on?
Trains monitors your Python process. When the process exits properly, Trains closes the experiment. When the process crashes and terminates abnormally, it sometimes misses the stop signal. In this case, you can safely right click the experiment in the Web-App and abort it.
My code throws an exception, but my experiment status is not "Failed". What happened?
This issue was resolved in v0.9.2. Upgrade Trains by executing the following command:
pip install -U trains
When I run my experiment, I get an SSL Connection error CERTIFICATE_VERIFY_FAILED. Do you have a solution?
Your firewall may be preventing the connection. Try one of the following solutions:
- Direct python "requests" to use the enterprise certificate file by setting the OS environment variables CURL_CA_BUNDLE or REQUESTS_CA_BUNDLE. For a detailed discussion of this topic, see https://stackoverflow.com/questions/48391750/disable-python-requests-ssl-validation-for-an-imported-module.
-
Disable certificate verification (for security reasons, this is not recommended):
-
Upgrade Trains to the current version:
pip install -U trains
-
Create a new
trains.conf
configuration file (see a sample trains.conf), containing:api { verify_certificate = False }
-
Copy the new
trains.conf
file to~/trains.conf
(on Windows:C:\Users\your_username\trains.conf
)
-
How do I modify experiment names once they have been created?
An experiment's name is a user-controlled property which can be accessed via the Task.name
variable. This allows you to use meaningful naming schemes for easily filtering and comparing of experiments.
For example, to distinguish between different experiments, you can append the task ID to the task name:
task = Task.init('examples', 'train')
task.name += ' {}'.format(task.id)
Or, append the Task ID post-execution:
tasks = Task.get_tasks(project_name='examples', task_name='train')
for t in tasks:
t.name += ' {}'.format(task.id)
Another example is to append a specific hyperparameter and its value to each task's name:
tasks = Task.get_tasks(project_name='examples', task_name='my_automl_experiment')
for t in tasks:
params = t.get_parameters()
if 'my_secret_parameter' in params:
t.name += ' my_secret_parameter={}'.format(params['my_secret_parameter'])
Use this experiment naming when creating automation pipelines with a naming convention.
Using Conda and the "typing" package, I get the error "AttributeError: type object 'Callable' has no attribute '_abc_registry'". How do I fix this?
Conda and the typing package may have some compatibility issues.
However, since Python 3.5, the typing
package is part of the standard library.
To resolve the error, uninstall typing
and rerun you script. If this does not fix the issue, create a new Trains issue, including the full error, and your environment details.
My Trains Server disk space usage is too high. What can I do about this?
We designed the Trains open source suite, including Trains Server, to ensure experiment traceability. For this reason, the Trains Web-App (UI) does not include a feature to delete experiments. The Trains Web-App (UI) does allow you to archive experiments so that they appear only in the Archive area.
In rare instances, however, such as high disk usage for a self-hosted Trains Server because Elasticsearch is indexing unwanted experiments, you may choose to delete an experiment.
You can use the APIClient
provided by Trains Agent and
client.tasks.delete()
to delete an experiment.
You cannot restore a deleted experiment.
You cannot undo the deletion of an experiment.
For example, the following script deletes an experiment whose Task ID is 123456789
.
from trains_agent import APIClient
client = APIClient()
client.tasks.delete(task='123456789')
Can I change the random seed my experiment uses?
Yes! By default, Trains initializes Tasks with a default seed. You change that seed by calling the make_deterministic method.
In the Web UI, I can't access files that my experiment stored. Why not?
Trains stores file locations. The machine running your browser must have access to the location where the machine which ran the Task stored the file. This applies to debug samples and artifacts. If, for example, the machine running the browser does not have access, you may see "Unable to load image", instead of the image.
I get the message "TRAINS Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start". What does it mean?
If metrics reporting begins within the first three minutes, Trains reports resource monitoring by iteration. Otherwise, it reports resource monitoring by seconds from start, and logs a message, "TRAINS Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start".
However, if metrics reporting begins after three minutes and anytime up to thirty minutes, resource monitoring reverts to by iteration, and Trains logs a message "TRAINS Monitor: Reporting detected, reverting back to iteration based reporting". After thirty minutes, it remains unchanged.
Can I control what Trains automatically logs?
Yes! Trains allows you to control automatic logging for stdout
, stderr
, and frameworks.
When initializing a Task by calling the Task.init
method, provide the auto_connect_frameworks
parameter to control framework logging, and the auto_connect_streams
parameter to control stdout
, stderr
, and standard logging. The values are True
, False
, and a dictionary for fine-grain control. See Task.init.
Graphs and Logs
The first log lines are missing from the experiment log tab. Where did they go?
Due to speed/optimization issues, we opted to display only the last several hundred log lines.
You can always downloaded the full log as a file using the Trains Web-App (UI). In Trains Web-App (UI), the experiment info panel, RESULTS tab, LOG sub-tab, use the download full log feature.
Can I create a graph comparing hyperparameters vs model accuracy?
Yes! You can manually create a plot with a single point X-axis for the hyperparameter value, and Y-Axis for the accuracy. For example:
number_layers = 10
accuracy = 0.95
Task.current_task().get_logger().report_scatter2d(
"performance", "accuracy", iteration=0,
mode='markers', scatter=[(number_layers, accuracy)])
Assuming the hyperparameter is number_layers
with current value 10
, and the accuracy
for the trained model is 0.95
. Then, the experiment comparison graph shows:
Another option is a histogram chart:
number_layers = 10
accuracy = 0.95
Task.current_task().get_logger().report_vector(
"performance", "accuracy", iteration=0, labels=['accuracy'],
values=[accuracy], xlabels=['number_layers %d' % number_layers])
I want to add more graphs, not just with TensorBoard. Is this supported?
Yes! The Logger module includes methods for explicit reporting. For examples of explicit reporting, see our Explicit Reporting tutorial, which includes a list of methods for explicit reporting.
How can I report more than one scatter 2D series on the same plot?
The Logger.report_scatter2d()
method reports all series with the same title
and iteration
parameter values on the same plot.
For example, the following two scatter2D series are reported on the same plot, because both have a title
of example_scatter
and an iteration
of 1
:
scatter2d_1 = np.hstack((np.atleast_2d(np.arange(0, 10)).T, np.random.randint(10, size=(10, 1))))
logger.report_scatter2d("example_scatter", "series_1", iteration=1, scatter=scatter2d_1,
xaxis="title x", yaxis="title y")
scatter2d_2 = np.hstack((np.atleast_2d(np.arange(0, 10)).T, np.random.randint(10, size=(10, 1))))
logger.report_scatter2d("example_scatter", "series_2", iteration=1, scatter=scatter2d_2,
xaxis="title x", yaxis="title y")
GIT and Storage
Is there something Trains can do about uncommitted code running?
Yes! Trains stores the git diff as part of the experiment's information. You can view the git diff in the Trains Web-App (UI), experiment info panel, EXECUTION tab.
I read there is a feature for centralized model storage. How do I use it?
When calling Task.init, providing the output_uri
parameter allows you to specify the location in which model checkpoints (snapshots) will be stored.
For example, to store model checkpoints (snapshots) in /mnt/shared/folder
:
task = Task.init(project_name, task_name, output_uri="/mnt/shared/folder")
Trains will copy all stored snapshots into a subfolder under /mnt/shared/folder
. The subfolder's name will contain the experiment's ID. If the experiment's ID is 6ea4f0b56d994320a713aeaf13a86d9d
, the following folder will be used:
/mnt/shared/folder/task.6ea4f0b56d994320a713aeaf13a86d9d/models/
Trains supports other storage types for output_uri
, including:
# AWS S3 bucket
task = Task.init(project_name, task_name, output_uri="s3://bucket-name/folder")
# Google Cloud Storage bucket
task = Task.init(project_name, task_name, output_uri="gs://bucket-name/folder")
To use Cloud storage with Trains, configure the storage credentials in your ~/trains.conf
. For detailed information, see Trains Configuration Reference.
When using PyCharm to remotely debug a machine, the Git repo is not detected. Do you have a solution?
Yes! Since this is such a common occurrence, we created a PyCharm plugin that allows a remote debugger to grab your local repository / commit ID. For detailed information about using our plugin, see the Trains PyCharm Plugin on the "Trains Plugins" page.
Jupyter
I am using Jupyter Notebook. Is this supported?
Yes! You can run Trains in Jupyter Notebooks using either of the following:
- Option 1: Install Trains on your Jupyter Notebook host machine
- Option 2: Install Trains in your Jupyter Notebook and connect using Trains credentials
Option 1: Install Trains on your Jupyter host machine
- Connect to your Jupyter host machine.
-
Install the Trains Python Package.
pip install trains
-
Run the Trains initialize wizard.
trains-init
-
In your Jupyter Notebook, you can now use Trains.
Option 2: Install Trains in your Jupyter Notebook
-
In the Trains Web-App (UI), Profile page, create credentials and copy your access key and secret key. These are required in the Step 3.
-
Install the Trains Python Package.
pip install trains
-
Use the Task.set_credentials method to specify the host, port, access key and secret key (see step 1).
# Set your credentials using the trains apiserver URI and port, access_key, and secret_key. Task.set_credentials(host='http://localhost:8008',key='<access_key>', secret='<secret_key>')
Host is the API server
host
is the API server (default port8008
), not the web server (default port8080
). -
You can now use Trains.
# create a task and start training task = Task.init('juptyer project', 'my notebook')
Git is not well-supported in Jupyter, so we just gave up on committing our code. Do you have a solution?
Yes! Use our Trains Jupyter Plugin. This plugin allows you to commit your notebook directly from Jupyter. It also saves the Python version of your code and creates an updated requirements.txt
, so you know which packages you were using.
Remote Debugging (Trains PyCharm Plugin)
I am using your Trains PyCharm Plugin for remote debugging. I get the message "trains.Task - INFO - Repository and package analysis timed out (10.0 sec), giving up". What should I do?
Trains uses a background thread to analyze the script. This includes package requirements. At the end of the execution of the script, if the background thread is still running, Trains allows the thread another 10 seconds to complete. If the thread does not complete, it times out.
This can occur for scripts that do not import any packages, for example short test scripts.
To fix this issue, you could import the time
package and add a time.sleep(20)
statement to the end of your script.
scikit-learn
Can I use Trains with scikit-learn?
Yes! scikit-learn
is supported. Everything you do is logged. Trains automatically logs models which are stored using joblib
. See the scikit-learn examples.
Trains Configuration
How do I explicitly specify the Trains configuration file to be used?
To override the default configuration file location, set the TRAINS_CONFIG_FILE
OS environment variable.
For example:
export TRAINS_CONFIG_FILE="/home/user/mytrains.conf"
How can I override Trains credentials from the OS environment?
To override your configuration file / defaults, set the following OS environment variables:
export TRAINS_API_ACCESS_KEY="key_here"
export TRAINS_API_SECRET_KEY="secret_here"
export TRAINS_API_HOST="http://localhost:8008"
How can I track OS environment variables with experiments?
Set the OS environment variable TRAINS_LOG_ENVIRONMENT
with the variables you need track, either:
-
All environment variables:
export TRAINS_LOG_ENVIRONMENT="*"
-
Specific environment variables, for example, log
PWD
andPYTHONPATH
:export TRAINS_LOG_ENVIRONMENT="PWD,PYTHONPATH"
-
No environment variables:
export TRAINS_LOG_ENVIRONMENT=
Trains Server Deployment
How do I deploy Trains Server on stand-alone Linux Ubuntu or macOS systems?
For detailed instructions, see Deploying Trains Server: Linux or macOS in the "Deploying Trains" section.
How do I deploy Trains Server on Windows 10?
For detailed instructions, see Deploying Trains Server: Windows 10 in the "Deploying Trains" section.
How do I deploy Trains Server on AWS EC2 AMIs?
For detailed instructions, see Deploying Trains Server: AWS EC2 AMIs in the "Deploying Trains" section.
How do I deploy Trains Server on the Google Cloud Platform?
For detailed instructions, see Deploying Trains Server: Google Cloud Platform in the "Deploying Trains" section.
How do I restart Trains Server?
For detailed instructions, see the "Restarting" section of our documentation page for your deployment format. For example, if you deployed to Linux, see Restarting on the "Deploying Trains Server: Linux or macOS" page.
Can I deploy Trains Server on Kubernetes clusters?
Yes! Trains Server supports Kubernetes. For detailed instructions, see Deploying Trains Server: Kubernetes in the "Deploying Trains" section.
Can I create a Helm Chart for Trains Server Kubernetes deployment?
Yes! You can create a Helm Chart of Trains Server Kubernetes deployment.
For detailed instructions, see Deploying Trains Server: Kubernetes using Helm in the "Deploying Trains" section.
My Docker cannot load a local host directory on SELinux?
If you are using SELinux, run the following command (see this discussion):
chcon -Rt svirt_sandbox_file_t /opt/trains
Trains Server Configuration
How do I configure Trains Server for sub-domains and load balancers?
For detailed instructions, see Configuring Sub-domains and load balancers on the "Configuring Trains Server" page.
Can I add web login authentication to Trains Server?
By default, anyone can login to the Trains Server Web-App. You can configure the Trains Server to allow only a specific set of users to access the system.
For detailed instructions, see Web Login Authentication on the "Configuring Trains Server" page in the "Deploying Trains" section.
Can I modify a non-responsive task settings?
The non-responsive experiment watchdog monitors experiments that were not updated for a specified time interval and marks them as aborted
. The watchdog is always active.
You can modify the following settings for the watchdog:
- The time threshold (in seconds) of task inactivity (default value is 7200 seconds which is 2 hours).
- The time interval (in seconds) between watchdog cycles.
For detailed instructions, see Modifying non-responsive Task watchdog settings on the "Configuring Trains Server" page.
Trains Server Troubleshooting
How do I fix Docker upgrade errors?
To resolve the Docker error:
... The container name "/trains-???" is already in use by ...
try removing deprecated images:
$ docker rm -f $(docker ps -a -q)
Why is web login authentication not working?
A port conflict between the Trains Server MongoDB and / or Elastic instances, and other instances running on your system may prevent web login authentication from working correctly.
Trains Server uses the following default ports which may be in conflict with other instances:
- MongoDB port
27017
- Elastic port
9200
You can check for port conflicts in the logs in /opt/trains/log
.
If a port conflict occurs, change the MongoDB and / or Elastic ports in the docker-compose.yml
, and then run the Docker compose commands to restart the Trains Server instance.
To change the MongoDB and / or Elastic ports for your Trains Server, do the following:
- Edit the
docker-compose.yml
file. -
In the
services/trainsserver/environment
section, add the following environment variable(s):-
For MongoDB:
MONGODB_SERVICE_PORT: <new-mongodb-port>
-
For Elastic:
ELASTIC_SERVICE_PORT: <new-elasticsearch-port>
For example:
MONGODB_SERVICE_PORT: 27018 ELASTIC_SERVICE_PORT: 9201
-
-
For MongoDB, in the
services/mongo/ports
section, expose the new MongoDB port:<new-mongodb-port>:27017 For example: 20718:27017
-
For Elastic, in the
services/elasticsearch/ports
section, expose the new Elastic port:<new-elasticsearch-port>:9200
For example:
9201:9200
-
Restart Trains Server, see Restarting Trains Server.\
How do I bypass a proxy configuration to access my local Trains Server?
A proxy server may block access to Trains Server configured for localhost
.
To fix this, you may allow bypassing of your proxy server to localhost
using a system environment variable, and configure Trains for Trains Server using it.
Do the following:
-
Allow bypassing of your proxy server to
localhost
using a system environment variable, for example:NO_PROXY = localhost
-
If a Trains configuration file (
trains.conf
) exists, delete it. - Open a terminal session.
-
In the terminal session, set the system environment variable to
127.0.0.1
, for example:-
Linux:
no_proxy=127.0.0.1 NO_PROXY=127.0.0.1
-
Windows:
set no_proxy=127.0.0.1 set NO_PROXY=127.0.0.1
-
-
Run the Trains wizard
trains-init
to configure Trains for Trains Server, which will prompt you to open the Trains Web-App (UI) at, http://127.0.0.1:8080/, and create new Trains credentials.The wizard completes with:
Verifying credentials ... Credentials verified! New configuration stored in /home/<username>/trains.conf TRAINS setup completed successfully.
The Trains Server keeps returning HTTP 500 (or 400) errors. How do I fix this?
The Trains Server will return HTTP error responses (5XX, or 4XX) when some of its backend components are failing.
A common cause for such a failure is low available disk space as the Elasticsearch service used by your server will go into read-only mode when it hits Elasticsearch flood watermark (by default, set to 95% disk space used).
This can be readily fixed by making more disk space available to the Elasticsearch service (Either freeing up disk space disk, or if using dynamic cloud storage, increasing the disk size).
A likely indication of this situation
Search your trains logs for "[FORBIDDEN/12/index read-only / allow delete (api)]".
Why is my Trains Web-App (UI) not showing any data?
If your Trains Web-App (UI) does not show anything, it may be an error authenticating with the server. Try clearing the application cookies for the site in your browser's developer tools.
Trains Agent
How can I execute Trains Agent without installing packages each time?
Instead of installing the Python packages in the virtual environment created by Trains Agent, you can optimize execution time by inheriting the packages from your global site-packages directory: in the Trains configuration file, set the configuration option agent.package_manager.system_site_packages
to true
.
Trains API
How can I use the Trains API to fetch data?
To fetch data using the Trains API, create an authenticated session and send requests for data using the Trains API services and methods. The responses to the requests contain your data.
For example, to get the metrics for an experiment and print metrics as a histogram:
- Start an authenticated session.
- Send a request for all projects named
examples
using theprojects
serviceGetAllRequest
method. - From the response, get the Ids of all those projects named
examples
. - Send a request for all experiments (tasks) with those project Ids using the
tasks
serviceGetAllRequest
method. - From the response, get the data for the experiment (task) ID
11
and print the experiment name. - Send a request for a metrics histogram for experiment (task) ID
11
using theevents
serviceScalarMetricsIterHistogramRequest
method and print the histogram.# Import Session from the trains backend_api from trains.backend_api import Session # Import the services for tasks, events, and projects from trains.backend_api.services import tasks, events, projects # Create an authenticated session session = Session() # Get projects matching the project name 'examples' res = session.send(projects.GetAllRequest(name='examples')) # Get all the project Ids matching the project name 'examples" projects_id = [p.id for p in res.response.projects] print('project ids: {}'.format(projects_id)) # Get all the experiments/tasks res = session.send(tasks.GetAllRequest(project=projects_id)) # Do your work # For example, get the experiment whose ID is '11' task = res.response.tasks[11] print('task name: {}'.format(task.name)) # For example, for experiment ID '11', get the experiment metric values res = session.send(events.ScalarMetricsIterHistogramRequest( task=task.id, )) scalars = res.response_data print('scalars {}'.format(scalars))