Tabular Data Preprocessing - Jupyter Notebook

The download_and_preprocessing.ipynb example demonstrates Trains storing preprocessed tabular data as artifacts, and explicitly reporting the tabular data in the Trains Web (UI). When the script runs, it creates an experiment named tabular preprocessing which is associated with the Table Example project.

This tabular data is prepared for another script, train_tabular_predictor.ipynb, which trains a network with it.

Artifacts

The example code preprocesses the downloaded data using Pandas DataFrames, and stores it as three artifacts:

  • Categories per column - Number of unique values per column of data.
  • Outcome dictionary - Class label enumeration for training.
  • Processed data - A dictionary containing the paths of the training and validation data.

Each artifact is uploaded by calling the Task.upload_artifact method. Artifacts appear in the ARTIFACTS tab.

image

Plots (tables)

The example code explicitly reports the data in Pandas DataFrames by calling the Logger.report_table method.

For example, the raw data is read into a Pandas DataFrame named train_set, and the head of the DataFrame is reported.

train_set = pd.read_csv(Path(path_to_ShelterAnimal) / 'train.csv')
Logger.current_logger().report_table(title='Trainset - raw',series='pandas DataFrame',iteration=0, table_plot=train_set.head())

The tables appear in RESULTS > PLOTS.

image

Hyperparameters

A parameter dictionary is logged by connecting it to the Task using a call to the Task.connect method.

logger = task.get_logger()
configuration_dict = {'test_size': 0.1, 'split_random_state': 0}
configuration_dict = task.connect(configuration_dict)

Parameter dictionaries appear in the General subsection.

image

Log

Output to the console appears in RESULTS > LOG.

image