Simple Pipeline - Serialized Data

The pipeline_controller.py example demonstrates a simple pipeline in Trains. This pipeline is composed of three steps: download data, process data, train a network. It is implemented using the automation.controller.PipelineController class. This class includes functionality to create a pipeline controller, add steps to the pipeline, pass data from one step to another, control the dependencies of a step beginning only after other steps complete, run the pipeline, wait for it to complete, and cleanup afterwards.

This example implements the pipeline with four Tasks (each Task is created using a different script):

  • Controller Task (pipeline_controller.py) - Creates a pipeline controller, adds the steps (Tasks) to the pipeline, runs the pipeline.
  • Step 1 Task (step1_dataset_artifact.py) - Downloads data and stores the data as an artifact.
  • Step 2 Task (step2_data_processing.py) - Loads the stored data (from Step 1), processes it, and stores the processed data as artifacts.
  • Step 3 Task (step3_train_model.py) - Loads the processed data (from Step 2) and trains a network.

When the pipeline runs, the Step 1, Step 2, and Step 3 Tasks are cloned and the newly cloned Task execute. The Task they are cloned from, called the base Task, does not execute. In this way, the pipeline can run multiple times. These base Tasks must have already run at least once for them to be in Trains Server to clone. The controller Task itself can be run from a development environment (run the script), or cloned, and the cloned Task executed remotely (if the controller Task has already run at least once and is in Trains Server).

The sections on this page (below) describe in more detail what happens in the controller Task and each step Task.

The pipeline controller

Create the pipeline controller object.

pipe = PipelineController(default_execution_queue='default', add_pipeline_tags=False)

Add Step 1. Call the automation.controller.PipelineController.add_step method.

pipe.add_step(name='stage_data', base_task_project='examples', base_task_name='pipeline step 1 dataset artifact')
  • name - The name of Step 1 (stage_data).
  • base_task_project and base_task_name - The Step 1 base Task to clone (the cloned Task will be executed when the pipeline runs).

Add Step 2.

pipe.add_step(name='stage_process', parents=['stage_data', ],
              base_task_project='examples', base_task_name='pipeline step 2 process dataset',
              parameter_override={'General/dataset_url': '${stage_data.artifacts.dataset.url}',
                                  'General/test_size': 0.25})
  • name - The name of Step 2 (stage_process).
  • base_task_project and base_task_name - The Step 2 base Task to clone.
  • parents - The start of Step 2 (stage_process) depends upon the completion of Step 1 (stage_data).
  • parameter_override - Pass the URL of the data artifact from Step 1 to Step 2. Override the value of the parameter whose key is dataset_url (in the parameter group named General). Override it with the URL of the artifact named dataset. Also override the test size.

    The syntax of the parameter_override value

    For other examples of parameter_override syntax, see the automation.controller.PipelineController.add_step.

Add Step 3.

pipe.add_step(name='stage_train', parents=['stage_process', ],
              base_task_project='examples', base_task_name='pipeline step 3 train model',
              parameter_override={'General/dataset_task_id': '${stage_process.id}'})

Provide the following add_step method parameters:

  • name - The name of Step 3 (stage_train).
  • parents - The start of Step 3 (stage_train) depends upon the completion of Step 2 (stage_process).
  • parameter_override - Pass the ID of the Step 2 Task to the Step 3 Task. This is the ID of the cloned Task, not the base Task.

Run the pipeline, wait for it to complete, and cleanup.

# Starting the pipeline (in the background)
pipe.start()
# Wait until pipeline terminates
pipe.wait()
# cleanup everything
pipe.stop()

Step 1 - Downloading the data

In the Step 1 Task (step1_dataset_artifact.py), the base Task is cloned and enqueued to execute.

task.execute_remotely()

The data is downloaded and stored as an artifact named dataset. This is the same artifact name used in parameter_override when the add_step method is called for the pipeline controller.

# simulate local dataset, download one, so we have something local
local_iris_pkl = StorageManager.get_local_copy(
    remote_url='https://github.com/allegroai/events/raw/master/odsc20-east/generic/iris_dataset.pkl')

# add and upload local file containing our toy dataset
task.upload_artifact('dataset', artifact_object=local_iris_pkl)

Step 2 - Processing the data

In the Step 2 Task (step2_data_processing.py), a parameter dictionary is created and connected to the Task.

args = {
    'dataset_task_id': '',
    'dataset_url': '',
    'random_state': 42,
    'test_size': 0.2,
}

# store arguments, later we will be able to change them from outside the code
task.connect(args)

The parameter dataset_url is the same parameter name used by parameter_override when the add_step method is called in the pipeline controller.

The base Task is cloned and enqueued to execute.

task.execute_remotely()

Later in the Step 2 Task, it uses that URL to get the data.

iris_pickle = StorageManager.get_local_copy(remote_url=args['dataset_url'])

After processing, it stores the processed data as artifacts.

task.upload_artifact('X_train', X_train)
task.upload_artifact('X_test', X_test)
task.upload_artifact('y_train', y_train)
task.upload_artifact('y_test', y_test)

Step 3 - Training the network

In the Step 3 Task (step3_train_model.py), a parameter dictionary is created and connected to the Task.

# Arguments
args = {
    'dataset_task_id': 'REPLACE_WITH_DATASET_TASK_ID',
}
task.connect(args)

The parameter dataset_task_id is overridden by the ID of the Step 2 Task (cloned Task, not base Task).

Clone the Step 3 base Task and enqueue it.

task.execute_remotely()

Later, use the Step 2 Task ID to get the processed data stored in artifacts.

dataset_task = Task.get_task(task_id=args['dataset_task_id'])
X_train = dataset_task.artifacts['X_train'].get()
X_test = dataset_task.artifacts['X_test'].get()
y_train = dataset_task.artifacts['y_train'].get()
y_test = dataset_task.artifacts['y_test'].get()

Finally, train the network and log plots, along with Trains automatic logging.

Running the pipeline

To run the pipeline:

  1. Run the script for each of the steps, if the script has not run once before.

    python step1_dataset_artifact.py
    python step2_data_processing.py
    python step3_train_model.py
    
  2. Run the pipeline controller one of the following two ways:

    • Run the script.

      python pipeline_controller.ipynb
      
    • Remotely execute the Task - If the Task pipeline demo in the project examples already exists in Trains Server, clone it and enqueue it to execute.

    If you enqueue a Task, a worker must be listening to that queue for the Task to execute.

The plot appears in RESULTS > PLOTS describing the pipeline. Hover over a step in the pipeline, and view the name of the step and the parameters overridden by the step.

image