October 5, 2020

7 Rules for Bulletproof, Reproducible Machine Learning R&D

Photo by Mina FC on Unsplash

So, if you’re a nose-to-the-keyboard developer, there’s ample probability that this analogy is outside your comfort zone … bear with me.

Imagine two Olympics-level figure skaters working together on the ice, day in and day out, to develop and perfect a medal-winning performance. Each has his or her role, and they work in sync to merge their actions and fine-tune the results. Each tiny change affects the other’s movements — hopefully, to improve their dance, but often to ruin it. Over time, they develop an ongoing communication channel to make sure that each knows what the other is doing for a consistent, always-improved result.

Machine learning represents a curiously similar dynamic, in which your models and code join the training data to work in tandem and produce the intended results. The path to optimization is — like that of the ice skaters — driven by small adjustments that need to be systematically tried and retried (and retried and retried), carefully and intentionally. But every change, every adjustment, and every new angle of attack also opens the door to error, confusion, and inconsistent inferences. In short, a lack of disciplined structure and planning leads to a deficit in reproducibility, quickly curtailing ML development.

Even if the metaphor above is too cliché for you, dealing with ML in any business use-case should have you nodding. Below, then, are my 7 guidelines for avoiding these pitfalls and optimizing reproducibility:

0. Your Code is Not “Self-Documenting”

Unless you are using extremely high-level solutions, producing ML models necessitates writing code. The basics for professional codebases are proper structure and solid documentation.

Despite this, as anyone who has glanced at open-sourced work on GitHub can attest to, we are constantly failing to do this properly. This “zeroth guideline” is not about simple adherence to PEP8 (please do!) or proper function naming (do you mind?).

One must recognize that since the processes we develop rely on heaps of configurations, their documentation and proper house-keeping should be enforced.

There is value, however, in a balance between loss of productivity due to “premature-engineering” and usability of research code further down the line.

1. Keep Track and Carry On

The first house-rule of any successful data science team is effective versioning.

Machine Learning relies on versioning more than other development disciplines because we leverage it in the twin components of the process: code and data. The fluid nature of ML development requires frequent changes to both the models being refined and optimized by the data scientist, as well as to the underlying data used to train these models.

Both the data scientist and the software itself must learn from each iteration, and then tweak the model to accommodate the anticipated data set. As such, each iteration must be documented and stored, ideally using an automatic mechanism to reduce manual logging overhead.

This critical aspect of ML development does not stop there; when working as a team, developers share their models and data to save time, learn from each other, and avoid repeated, redundant efforts.

Without efficient, ongoing versioning, the dynamic process of reproducibility is complicated at best, impossible at worst.

2. Apples to Apples

Reproducibility, a core factor in ML model development, hinges on comparability. Without the ability to compare the results of a training session from Model A to that of Model B, there is no way to know what changes led to improvement or degradation of results

Research iterations require, by definition, changes between rounds. Nevertheless, there is a practical limit to the architectural and logical differences between models when experimenting. If models are not structured similarly, powered by the same core logic, but with only nuanced differences, there’s no way to know what factor you need to focus on for improvement in an experiment.

Your pipeline design should expose your changes and actually facilitate connecting them with changes in performance.

3. You Can’t Spell Production Without ‘U’ and ‘I’

The former guideline also extends to the interface between R&D and production. Even the most well-designed and carefully reviewed models can fail or return unexpected results if the training and serving (production) environment don’t match.

Even so, there are stringent performance requirements from models, along with the necessity to integrate with the rest of the business logic. These often result in a final implementation which is completely different from the research code

When that happens, failures and even common, basic problems are hard to track down by the model’s creator. Unless there is sufficient justification to do so, the production environment should flawlessly mimic the one that nurtured the model.

But alas, since this rarely happens, data science teams should collaborate with DevOps as much as possible on the handoff. Start this collaboration early on in the project’s lifecycle.

4. Plan for Inconsistency in Data Sources

Research and Development are done on subsets of real-world data. The more your use-case relies on temporal data, the more you will have to deal with common problems of model mismatch.

Regardless of this, beware of the existence of ‘pipeline debt’: the mechanisms used to extract and format data can change over time. Even a small feature dependency in this process can alter the data enough to change the results, breaking the logic… and with it, the resulting inferences you can make based on a model’s success or failure.

As such, reproducibility really does start at the source. A seasoned ML data scientist never takes for granted that his or her data is “pure” and consistent, and checks (often!) that no critical dependencies have changed in the data’s acquisition and formatting.

While a sudden failure in a model’s performance is a lucky indicator, often a change in data simply triggers a confusing, distracting, and avoidable anomaly in results that must be investigated.

Clearly indicating all dependencies in your code will help prevent this; it is also recommended to implement pipeline testing.

5. Testing is not just “metrics”

While our zeroth guideline alluded to software development best practices, we just discussed testing of sub-components, and not in the sense of Area-Under-the-Curve. This is another fundamental aspect of software R&D that is usually eschewed in ML, but really should not be.

Loosely explained, at each stage of ML development, you must be able to make sure that your changes to the codebase or data have not triggered a problem needed solving further down the road.

It is important to remember that this testing mindset applies to both aspects of your work. First is the ongoing review of individual, stand-alone components (units) to assure that no change to a specific building block will have unexpected results. Code used for preprocessing is an excellent example of this, as a problem there creates a waterfall effect.

Just as importantly, (expensive!) tests of the entire codebase in a real-world simulation — the model with its data and configuration — help prevent a combination of factors from creating suboptimal results

6. Track, Track again

The retraining and resulting improvement to a model rely heavily on tracking, recording, and comparing results. The pitfalls of manual logging are numerous — it leads to human error, localized (and therefore inaccessible) storage of results, variation in formatting or framing of data, and worst of all, the failure to do so completely or at all.

Use an automated pipeline to provide ongoing collection of results in the background, with no researcher or developer effort.

This meta-data is standardized for consistency, and can often be compared automatically using a graphical charting tool to quickly make clear how one set of results differed from another.

Finally, these results are centralized, so that, together with their standardized format, they can be leveraged by others on your team.

7. Using a Platform to Really Tie It All Together

Photo by Marvin Meyer on Unsplash

So we’ve covered the many ways versioning and tracking are important (in addition to testing). The infrastructure components needed to follow the previous guidelines are nowadays readily available as open-source solutions.

Still, assembling individual components into a platform from which to conduct collaborative research is not an easy task, and depends also on your team’s work ethic.

Opting for an off-the-shelf experimentation platform that automates most of the work needed to keep these guidelines is empirically a good choice: You’ve got incredible challenges pushing your creativity and technical skills to build something unique, powerful, and worth the money they are paying you.

There is no reason to lose focus, sacrifice quality, and reinvent the wheel when you can deploy a solution loaded with best practices and streamlined processes.


These were my 7 (actually 8) golden guidelines for Machine Learning R&D. The last one is actually the conclusion: Use a platform that keeps the guidelines for you.

There are, of course, more aspects to consider when actually deploying models, especially for monitoring and logging.

Every project and every developer is unique, so I would love to hear from your experience what matters most… leave a comment!

Scroll to Top