Dataset

class clearml.Dataset
add_files(path, wildcard=None, local_base_folder=None, dataset_path=None, recursive=True, verbose=False)

Add a folder into the current dataset. calculate file hash, and compare against parent, mark files to be uploaded

Parameters
  • path – Add a folder/file to the dataset

  • wildcard – add only specific set of files. Wildcard matching, can be a single string or a list of wildcards)

  • local_base_folder – files will be located based on their relative path from local_base_folder

  • dataset_path – where in the dataset the folder/files should be located

  • recursive – If True match all wildcard files recursively

  • verbose – If True print to console files added/modified

Returns

number of files added

classmethod create(dataset_name, dataset_project=None, parent_datasets=None, use_current_task=False)

Create a new dataset. Multiple dataset parents are supported. Merging of parent datasets is done based on the order, where each one can override overlapping files in the previous parent

Parameters
  • dataset_name – Naming the new dataset

  • dataset_project – Project containing the dataset. If not specified, infer project name form parent datasets

  • parent_datasets – Expand a parent dataset by adding/removing files

  • use_current_task – False (default), a new Dataset task is created. If True, the dataset is created on the current Task.

Returns

Newly created Dataset object

classmethod delete(dataset_id=None, dataset_project=None, dataset_name=None, force=False)

Delete a dataset, raise exception if dataset is used by other dataset versions. Use force=True to forcefully delete the dataset

Parameters
  • dataset_id – Dataset id to delete

  • dataset_project – Project containing the dataset

  • dataset_name – Naming the new dataset

  • force – If True delete even if other datasets depend on the specified dataset version

property file_entries_dict

Notice this call returns an internal representation, do not modify! :return: dict with relative file path as key, and FileEntry as value

finalize(verbose=False, raise_on_error=True)

Finalize the dataset publish dataset Task. upload must first called to verify there are not pending uploads. If files do need to be uploaded, it throws an exception (or return False)

Parameters
  • verbose – If True print verbose progress report

  • raise_on_error – If True raise exception if dataset finalizing failed

classmethod get(dataset_id=None, dataset_project=None, dataset_name=None, dataset_tags=None, only_completed=False, only_published=False)

Get a specific Dataset. If only dataset_project is given, return the last Dataset in the Dataset project

Parameters
  • dataset_id – Requested Dataset ID

  • dataset_project – Requested Dataset project name

  • dataset_name – Requested Dataset name

  • dataset_tags – Requested Dataset tags (list of tag strings)

  • only_completed – Return only if the requested dataset is completed or published

  • only_published – Return only if the requested dataset is published

Returns

Dataset object

get_default_storage()

Return the default storage location of the dataset

Returns

URL for the default storage location

get_dependency_graph()

return the DAG of the dataset dependencies (all previous dataset version and their parents)

Example:

{
    'current_dataset_id': ['parent_1_id', 'parent_2_id'],
    'parent_2_id': ['parent_1_id'],
    'parent_1_id': [],
}
Returns

dict representing the genealogy dag graph of the current dataset

get_local_copy(use_soft_links=None, raise_on_error=True)
return a base folder with a read-only (immutable) local copy of the entire dataset

download and copy / soft-link, files from all the parent dataset versions

Parameters
  • use_soft_links – If True use soft links, default False on windows True on Posix systems

  • raise_on_error – If True raise exception if dataset merging failed on any file

Returns

A base folder for the entire dataset

get_logger()

Return a Logger object for the Dataset, allowing users to report statistics metrics and debug samples on the Dataset itself :return: Logger object

get_mutable_local_copy(target_folder, overwrite=False, raise_on_error=True)
return a base folder with a writable (mutable) local copy of the entire dataset

download and copy / soft-link, files from all the parent dataset versions

Parameters
  • target_folder – Target folder for the writable copy

  • overwrite – If True, recursively delete the target folder before creating a copy. If False (default) and target folder contains files, raise exception or return None

  • raise_on_error – If True raise exception if dataset merging failed on any file

Returns

A the target folder containing the entire dataset

is_dirty()

Return True if the dataset has pending uploads (i.e. we cannot finalize it)

Returns

Return True means dataset has pending uploads, call ‘upload’ to start an upload process.

is_final()

Return True if the dataset was finalized and cannot be changed any more.

Returns

True if dataset if final

list_added_files(dataset_id=None)

return a list of files removed when comparing to a specific dataset_version

Parameters

dataset_id – dataset id (str) to compare against, if None is given compare against the parents datasets

Returns

List of files with relative path (files might not be available locally until get_local_copy() is called)

classmethod list_datasets(dataset_project=None, partial_name=None, tags=None, ids=None, only_completed=True)

Query list of dataset in the system

Parameters
  • dataset_project – Specify dataset project name

  • partial_name – Specify partial match to a dataset name

  • tags – Specify user tags

  • ids – List specific dataset based on IDs list

  • only_completed – If False return dataset that are still in progress (uploading/edited etc.)

Returns

List of dictionaries with dataset information Example: [{‘name’: name, ‘project’: project name, ‘id’: dataset_id, ‘created’: date_created},]

list_files(dataset_path=None, recursive=True, dataset_id=None)

returns a list of files in the current dataset If dataset_id is give, return a list of files that remained unchanged since the specified dataset_version

Parameters
  • dataset_path – Only match files matching the dataset_path (including wildcards). Example: ‘folder/sub/*.json’

  • recursive – If True (default) matching dataset_path recursively

  • dataset_id – Filter list based on the dataset id containing the latest version of the file. Default: None, do not filter files based on parent dataset.

Returns

List of files with relative path (files might not be available locally until get_local_copy() is called)

list_modified_files(dataset_id=None)

return a list of files removed when comparing to a specific dataset_version

Parameters

dataset_id – dataset id (str) to compare against, if None is given compare against the parents datasets

Returns

List of files with relative path (files might not be available locally until get_local_copy() is called)

list_removed_files(dataset_id=None)

return a list of files removed when comparing to a specific dataset_version

Parameters

dataset_id – dataset id (str) to compare against, if None is given compare against the parents datasets

Returns

List of files with relative path (files might not be available locally until get_local_copy() is called)

publish(raise_on_error=True)

Publish the dataset If dataset is not finalize, throw exception

Parameters

raise_on_error – If True raise exception if dataset publishing failed

remove_files(dataset_path=None, recursive=True, verbose=False)

Remove files from the current dataset

Parameters
  • dataset_path – Remove files from the dataset. The path is always relative to the dataset (e.g ‘folder/file.bin’)

  • recursive – If True match all wildcard files recursively

  • verbose – If True print to console files removed

Returns

Number of files removed

classmethod squash(dataset_name, dataset_ids=None, dataset_project_name_pairs=None, output_url=None)

Generate a new dataset from the squashed set of dataset versions. If a single version is given it will squash to the root (i.e. create single standalone version) If a set of versions are given it will squash the versions diff into a single version

Parameters
  • dataset_name – Target name for the newly generated squashed dataset

  • dataset_ids – List of dataset Ids (or objects) to squash. Notice order does matter. The versions are merged from first to last.

  • dataset_project_name_pairs – List of pairs (project_name, dataset_name) to squash. Notice order does matter. The versions are merged from first to last.

  • output_url – Target storage for the compressed dataset (default: file server) Examples: s3://bucket/data, gs://bucket/data , azure://bucket/data , /mnt/share/data

Returns

Newly created dataset object.

sync_folder(local_path, dataset_path=None, verbose=False)
Synchronize the dataset with a local folder. The dataset is synchronized from the

relative_base_folder (default: dataset root) and deeper with the specified local path.

Parameters
  • local_path – Local folder to sync (assumes all files and recursive)

  • dataset_path – Target dataset path to sync with (default the root of the dataset)

  • verbose – If true print to console files added/modified/removed

Returns

number of files removed, number of files modified/added

upload(show_progress=True, verbose=False, output_url=None, compression=None)

Start file uploading, the function returns when all files are uploaded.

Parameters
  • show_progress – If True show upload progress bar

  • verbose – If True print verbose progress report

  • output_url – Target storage for the compressed dataset (default: file server) Examples: s3://bucket/data, gs://bucket/data , azure://bucket/data , /mnt/share/data

  • compression – Compression algorithm for the Zipped dataset file (default: ZIP_DEFLATED)

verify_dataset_hash(local_copy_path=None, skip_hash=False, verbose=False)

Verify the current copy of the dataset against the stored hash

Parameters
  • local_copy_path – Specify local path containing a copy of the dataset, If not provide use the cached folder

  • skip_hash – If True, skip hash checks and verify file size only

  • verbose – If True print errors while testing dataset files hash

Returns

List of files with unmatched hashes