Dataset

class clearml.Dataset
add_files(path, wildcard=None, local_base_folder=None, dataset_path=None, recursive=True, verbose=False)

Add a folder into the current dataset. calculate file hash, and compare against parent, mark files to be uploaded

Parameters
  • path – Add a folder/file to the dataset

  • wildcard – add only specific set of files. Wildcard matching, can be a single string or a list of wildcards)

  • local_base_folder – files will be located based on their relative path from local_base_folder

  • dataset_path – where in the dataset the folder/files should be located

  • recursive – If True match all wildcard files recursively

  • verbose – If True print to console files added/modified

Returns

number of files added

classmethod create(dataset_name, dataset_project=None, parent_datasets=None)

Create a new dataset. Multiple dataset parents are supported. Merging of parent datasets is done based on the order, where each one can override overlapping files in the previous parent

Parameters
  • dataset_name – Naming the new dataset

  • dataset_project – Project containing the dataset.

If not specified, infer project name form parent datasets :param parent_datasets: Expand a parent dataset by adding/removing files :return: Newly created Dataset object

classmethod delete(dataset_id=None, dataset_project=None, dataset_name=None, force=False)

Delete a dataset, raise exception if dataset is used by other dataset versions. Use force=True to forcefully delete the dataset

Parameters
  • dataset_id – Dataset id to delete

  • dataset_project – Project containing the dataset

  • dataset_name – Naming the new dataset

  • force – If True delete even if other datasets depend on the specified dataset version

property file_entries_dict

Notice this call returns an internal representation, do not modify! :return: dict with relative file path as key, and FileEntry as value

finalize(verbose=False, raise_on_error=True)

Finalize the dataset (if upload was not called, it will be called automatically) publish dataset Task. If files need to be uploaded, throw exception (or return False)

Parameters
  • verbose – If True print verbose progress report

  • raise_on_error – If True raise exception if dataset finalizing failed

classmethod get(dataset_id=None, dataset_project=None, dataset_name=None, only_completed=False, only_published=False)

Get a specific Dataset. If only dataset_project is given, return the last Dataset in the Dataset project

Parameters
  • dataset_id – Requested Dataset ID

  • dataset_project – Requested Dataset project name

  • dataset_name – Requested Dataset name

  • only_completed – Return only if the requested dataset is completed or published

  • only_published – Return only if the requested dataset is published

Returns

Dataset object

get_default_storage()

Return the default storage location of the dataset

Returns

URL for the default storage location

get_dependency_graph()

return the DAG of the dataset dependencies (all previous dataset version and their parents/ Example: {

‘current_dataset_id’: [‘parent_1_id’, ‘parent_2_id’], ‘parent_2_id’: [‘parent_1_id’], ‘parent_1_id’: [],

}

Returns

dict representing the genealogy dag graph of the current dataset

get_local_copy(use_soft_links=None, raise_on_error=True)
return a base folder with a read-only (immutable) local copy of the entire dataset

download and copy / soft-link, files from all the parent dataset versions

Parameters
  • use_soft_links – If True use soft links, default False on windows True on Posix systems

  • raise_on_error – If True raise exception if dataset merging failed on any file

Returns

A base folder for the entire dataset

get_mutable_local_copy(target_folder, overwrite=False, raise_on_error=True)
return a base folder with a writable (mutable) local copy of the entire dataset

download and copy / soft-link, files from all the parent dataset versions

Parameters
  • target_folder – Target folder for the writable copy

  • overwrite – If True, recursively delete the target folder before creating a copy. If False (default) and target folder contains files, raise exception or return None

  • raise_on_error – If True raise exception if dataset merging failed on any file

Returns

A the target folder containing the entire dataset

is_dirty()

Return True if the dataset has pending uploads (i.e. we cannot finalize it)

Returns

Return True means dataset has pending uploads, call ‘upload’ to start an upload process.

is_final()

Return True if the dataset was finalized and cannot be changed any more.

Returns

True if dataset if final

list_added_files(dataset_id=None)

return a list of files removed when comparing to a specific dataset_version

Parameters

dataset_id – dataset id (str) to compare against, if None is given compare against the parents datasets

Returns

List of files with relative path (files might not be available locally until get_local_copy() is called)

classmethod list_datasets(dataset_project=None, partial_name=None, tags=None, ids=None, only_completed=True)

Query list of dataset in the system

Parameters
  • dataset_project – Specify dataset project name

  • partial_name – Specify partial match to a dataset name

  • tags – Specify user tags

  • ids – List specific dataset based on IDs list

  • only_completed – If False return dataset that are still in progress (uploading/edited etc.)

Returns

List of dictionaries with dataset information Example: [{‘name’: name, ‘project’: project name, ‘id’: dataset_id, ‘created’: date_created},]

list_files(dataset_path=None, recursive=True, dataset_id=None)

returns a list of files in the current dataset If dataset_id is give, return a list of files that remained unchanged since the specified dataset_version

Parameters
  • dataset_path – Only match files matching the dataset_path (including wildcards). Example: folder/sub/*.json

  • recursive – If True (default) matching dataset_path recursively

  • dataset_id – Filter list based on the dataset id containing the latest version of the file. Default: None, do not filter files based on parent dataset.

Returns

List of files with relative path (files might not be available locally until get_local_copy() is called)

list_modified_files(dataset_id=None)

return a list of files removed when comparing to a specific dataset_version

Parameters

dataset_id – dataset id (str) to compare against, if None is given compare against the parents datasets

Returns

List of files with relative path (files might not be available locally until get_local_copy() is called)

list_removed_files(dataset_id=None)

return a list of files removed when comparing to a specific dataset_version

Parameters

dataset_id – dataset id (str) to compare against, if None is given compare against the parents datasets

Returns

List of files with relative path (files might not be available locally until get_local_copy() is called)

remove_files(dataset_path=None, recursive=True, verbose=False)

Add a folder into the current dataset. calculate file hash, and compare against parent, mark files to be uploaded

Parameters
  • dataset_path – Remove files from the dataset. The path is always relative to the dataset (e.g ‘folder/file.bin’)

  • recursive – If True match all wildcard files recursively

  • verbose – If True print to console files removed

Returns

Number of files removed

classmethod squash(dataset_name, dataset_ids=None, dataset_project_name_pairs=None, output_url=None)

Generate a new dataset from the squashed set of dataset versions. If a single version is given it will squash to the root (i.e. create single standalone version) If a set of versions are given it will squash the versions diff into a single version

Parameters
  • dataset_name – Target name for the newly generated squashed dataset

  • dataset_ids – List of dataset Ids (or objects) to squash. Notice order does matter. The versions are merged from first to last.

  • dataset_project_name_pairs – List of pairs (project_name, dataset_name) to squash. Notice order does matter. The versions are merged from first to last.

  • output_url – Target storage for the compressed dataset (default: file server) Examples: s3://bucket/data, gs://bucket/data , azure://bucket/data , /mnt/share/data

Returns

Newly created dataset object.

sync_folder(local_path, dataset_path=None, verbose=False)
Synchronize the dataset with a local folder. The dataset is synchronized from the

relative_base_folder (default: dataset root) and deeper with the specified local path.

Parameters
  • local_path – Local folder to sync (assumes all files and recursive)

  • dataset_path – Target dataset path to sync with (default the root of the dataset)

  • verbose – If true print to console files added/modified/removed

Returns

number of files removed, number of files modified/added

upload(show_progress=True, verbose=False, output_url=None, compression=None)

Start file uploading, the function returns when all files are uploaded.

Parameters
  • show_progress – If True show upload progress bar

  • verbose – If True print verbose progress report

  • output_url – Target storage for the compressed dataset (default: file server) Examples: s3://bucket/data, gs://bucket/data , azure://bucket/data , /mnt/share/data

  • compression – Compression algorithm for the Zipped dataset file (default: ZIP_DEFLATED)

verify_dataset_hash(local_copy_path=None, skip_hash=False, verbose=False)

Verify the current copy of the dataset against the stored hash

Parameters
  • local_copy_path – Specify local path containing a copy of the dataset, If not provide use the cached folder

  • skip_hash – If True, skip hash checks and verify file size only

  • verbose – If True print errors while testing dataset files hash

Returns

List of files with unmatched hashes