ClearML AWS Autoscaler Service

The ClearML AWS autoscaler optimizes AWS EC2 instance scaling according to the instance types you want to use, and the budget you configure. In the budget, you set the maximum number of each instance type to spin up for experiments awaiting execution in a specific queue. Configure multiple instance types per queue, and multiple queues. The ClearML AWS autoscaler will spin down idle instances based on a maximum idle time and polling interval you configure. Its Task name is AWS Auto-Scaler and it is associated with the DevOps project.

The ClearML AWS autoscaler can execute in ClearML services mode and is configurable. It is pre-loaded in ClearML Server and its status is Draft (editable). You can set the instance types and configure your budget in the ClearML Web UI, and then enqueue the Task to the services queues. Or, run the script aws_autoscaler.py, which contains a wizard to help you configure everything and with options to run locally or as a service.

Running the ClearML AWS autoscaler

You can run the ClearML AWS autoscaler two ways:

  • By running the script

  • In the ClearML Web UI

Running using the script

The aws_autoscaler.py script includes a wizard which prompts you for the instance details, and budget configuration.

  1. The script can run in three ways:

    1. Configure and create a task without running it:

      python aws_autoscaler.py
      
    2. Configure and enqueue the autoscaler task to be executed by a ClearML Agent on a remote machine:

      Use the remote command line option:

      python aws_autoscaler.py --remote
      
    3. Configure and run the autoscaler task locally:

      Use the run command line option:

      python aws_autoscaler.py --run
      
  2. When the script runs, you can either choose to use an existing configuration, or use the a configuration wizard that prompts you for all the required information.

View the configuration wizard steps
  1. The setup wizard begins. Enter the AWS credentials and AWS region name.

    AWS Autoscaler setup wizard
    ---------------------------
    Follow the wizard to configure your AWS auto-scaler service.
    Once completed, you will be able to view and change the configuration in the clearml-server web UI.
    It means there is no need to worry about typos or mistakes :)
    
    Enter AWS Access Key ID : 
    Enter AWS Secret Access Key : 
    Enter AWS region name [us-east-1]:
    
  2. Enter Git credentials. These are required by ClearML Agent to setup a Task execution environment in an AWS EC2 instance.

    GIT credentials:
    Enter GIT username for repository cloning (leave blank for SSH key authentication): []
    Enter password for user '<username>':
    

    The wizard reports the Git credentials it will use.

    Git repository cloning will be using user=*************** password=***********
    
  3. Enter the default Docker image and parameters to use.

    Enter default docker image/parameters to use [nvidia/cuda:10.1-runtime-ubuntu18.04]:
    
  4. For each AWS EC2 instance type you will use in your budget, choose the instance type, whether to use spot instances, select an AMI, and define the Amazon EBS volume. Select as many instance types as you may need.

    Configure the machine types for the auto-scaler:
    ------------------------------------------------
    Select Amazon instance type ['g4dn.4xlarge']:
    Use spot instances? [y/N]: y
    Select availability zone ['us-east-1b']:
    Select the Amazon Machine Image id ['ami-04c0416d6bd8e4b1f']:
    Enter the Amazon EBS device ['/dev/sda1']:
    Enter the Amazon EBS volume size (in GiB) [100]:
    Enter the Amazon EBS volume type ['gp3']:
    Enter the Amazon Key Pair name :
    Enter Amazon Security Group ID :
    

    Name the instance type you configured. Later in the configuration, you use this name to create your budget.

    Select a name for this instance type (used in the budget section) For example 'aws4gpu':
    

    The wizard prompts to select another instance type.

    Define another instance type? [y/N]:
    
  5. Enter any bash script you want to run on newly created instances, before ClearML Agent executes. To finish entering the script and continue to the next step, enter two consecutive empty lines.

    Enter any pre-execution bash script to be executed on the newly created instances []
    Note: two consecutive empty lines would terminate the input :
    

    The wizard reports the number of lines of data that were detected.

    Entered 0 lines of pre-execution bash script
    
  6. Enter any extra configurations you’d like to include in the instance’s clearml.conf file. In order to continue to the next configuration enter tow consecutive empty lines.

    Enter anything you'd like to include in your clearml.conf file []
    Note: two consecutive empty lines would terminate the input :
    

    The wizard reports the number of lines of data that were detected.

    Entered 0 extra lines for clearml.conf file
    
  7. Configure the AWS autoscaler budget. For each queue that you want to use in your budget, select the queue, and the maximum number of each instance type that the ClearML AWS autoscaler can spin up to execute experiments awaiting execution in that queue.

    Define the machines budget:
    -----------------------------
      
    Select a queue name (for example: 'aws_4gpu_machines') : 
    Select a instance type to attach to the queue ['aws-g4dn.xlarge', 'aws-g4dn.8xlarge', 'aws-g4dn.16xlarge']:
    Enter maximum number of 'aws-g4dn.xlarge' instances to spin simultaneously (example: 3) :         
    
  8. If you want to add another queue for the autoscaler to listen to, add it. The previous step repeats.

    Add another queue? [y/N]:         
    
  9. The ClearML AWS autoscalar polls instances, and if they have been idle for the maximum idle time you specify, the autoscaler spins them down. You can accept or change the defaults values.

    Enter maximum idle time for the auto-scaler to spin down an instance (in minutes) [15]:
    Enter instances polling interval for the auto-scaler (in minutes) [5]:
    

    The configuration is complete. ClearML initializes the Task AWS Auto-Scaler, the service begins, and the script prints a hyperlink to the Task’s log.

    CLEARML Task: created new task id=d0ee5309a9a3471d8802f2561da60dfa
    CLEARML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
    CLEARML results page: https://app.clearml-master.hosted.allegro.ai/projects/142a598b5d234bebb37a57d692f5689f/experiments/d0ee5309a9a3471d8802f2561da60dfa/output/log
    Running AWS auto-scaler as a service
    Execution log https://app.clearml-master.hosted.allegro.ai/projects/142a598b5d234bebb37a57d692f5689f/experiments/d0ee5309a9a3471d8802f2561da60dfa/output/log</code></pre>     
    

Running using the ClearML Web UI

Edit the parameters for the instance types, edit budget configuration by editing the Task, and then enqueue the Task to run in ClearML Agent services mode.

  1. In the ClearML Web UI > Projects page > DevOps project > AWS Auto-Scaler Task.

  2. Set the AWS and Git credentials, parameters for idle AWS EC2 instances, and a worker prefix.

    • In the CONFIGURATIONS tab > HYPER PARAMETERS > Args > hover > EDIT.

      • cloud_credentials_key - AWS access key.

      • cloud_credentials_region - AWS region.

      • cloud_credentials_secret - AWS access secret.

      • cloud_provider - AWS.

      • default_docker_image - The default Docker image to use for the AWS EC2 instance.

      • git_pass - Git password.

      • git_user - Git username.

      • max_idle_time_min - The maximum time an AWS EC2 instance can be idle before the ClearML AWS autoscaler spins it down.

      • polling_interval_time_min - How often the ClearML AWS autoscaler checks for idle instances.

      • workers_prefix

  3. Configure the budget.

    • In CONFIGURATION OBJECTS > General, hover > EDIT. Edit the resource_configurations dictionary:

        resource_configurations {
            <resource-name> {
              instance_type = "<instance_type>"
              is_spot = <boolean>
              availability_zone = "<AWS-region>"
              ami_id = "<AMI-ID>"
              ebs_device_name = "<EBS-device-name>"
              ebs_volume_size = <EBS-size-in-GB>
              ebs_volume_type = "<EBS-vol-type>"
              key_name = "<key-name>"
              security_group_ids = ["<security-group-id"]
              extra_configurations = {"<configuration-key>": "<configuration-value>"} 
            }
        }
        queues {
            <queue-name> = [["<resource-name>", <max-instances-of-resource-name>]]
        }
        extra_clearml_conf = """
        <ClearML-config>
        """
        extra_vm_bash_script = """
        <bash-script>
        """
      

      where,

      • <resource-name> - The name you assign to each resource (AWS EC2 instance type). Used in the budget.

      • key_name - Optional, specify an ec2 key pair to be used.

      • security_group_ids - Optional, add security groups for the instance.

      • extra_configurations - Optional, any extra configuration that wasn’t specifically specified in the autoscaler. Make sure to use the ec2 request syntax. For example use extra_configurations = {"SubnetId": "<subnet-id>"} to add a subnet to the resource.

      • queues - The ClearML AWS autoscaler will optimize scaling for experiments awaiting execution in these queues.

      • <queue-name> - A specific queue.

      • <max-instances-of-resource-name> - The maximum number of instances of the specified resource-name to spin up.

      • is_spot - If true, then use a spot instance. If false, then use a reserved instance.

      • extra_clearml_conf - A ClearML configuration to use for executing experiments in ClearML Agent.

      • extra_vm_bash_script - A bash script to execute when creating an instance, before ClearML Agent executes.

View a screenshot
image
  1. Set the Task to run in ClearML Agent services mode.

    1. In HYPER PARAMETERS > Args > hover > EDIT.

    2. Change the remote parameter to true.

    View a screenshot
    image
  2. Click SAVE.

  3. In the experiments table, right click the AWS Auto-Scaler Task > Enqueue > services queue > ENQUEUE.