Republished here with PyTorch & author’s permission. Original post here.
Authored by: Izik Golan — Trigo Deep Learning Researchers
Trigo is a provider of AI & computer vision based checkout-free systems for the retail market, enabling frictionless checkout and a range of other in-store operational and marketing solutions such as predictive inventory management, security and fraud prevention, pricing optimization and event-driven marketing.
The system is based on ceiling-mounted off-the-shelf cameras and sensors, powered by proprietary deep learning algorithms, built to map and analyze the location and movement of every object throughout the store. When shoppers pick an item from the shelf, the system automatically detects the event and adds it to a virtual shopping list. Once the shoppers are done, they can simply walk out with no need to go through conventional checkout. Upon leaving the store, customers are charged automatically and the receipt is sent to their mobile device.
The task we had to solve
In order to build checkout-free systems we employ multiple AI & PyTorch models and techniques, with an application logic layered on top that combines all these inferences into one coherent decision. This is very similar to an autonomous vehicle, where one would have different systems, like lane detection, object avoidance, GPS location, path find etc, with a unified logic layer on top combining all the different sensors and models into a single decision for the vehicle to take at any given moment.
To deliver this, we have set up a very complex system to develop and deploy models such as object detection, tracking, pose estimation, etc. This is done by several teams working in full collaboration, each team on a different aspect of the system.
Workflow & infrastructure goals
Having strong data science and engineering teams, we had two focused goals for our infrastructure and workflow design. These goals were critical to us as they represented pain points often encountered by companies attempting to build deployable AI solutions:
- Quick data-science to engineering team turn-around. It was important for us to expedite the research-to-production cycle so that we could iterate and innovate faster. As such, it was critical to address the fault line b/w the data science team building the core algorithm and DL models and the engineering team tasked with building the application logic on top.
- Managing AI / DL lifecycle just like in software engineering. Despite the fact that AI development and deployment is a very different process compared to traditional software, as engineers we knew that we had to bring in some core concepts from s/w management to enable efficient AI / DL development and deployment. These include implementing the same best practices of versioning, collaboration, DevOps and CI/CD to the AI / DL pipeline.
To accomplish these goals, the most important tasks were choosing the right tools for the job, and creating “special sauce” for our unique needs around them.
The first choice we had to make was which machine learning framework to use. The choice fell on PyTorch. We made this initial choice for the following main reasons:
- It was simple and quick to debug.
- It was very easy to build complex and flexible networks from code.
- By leveraging PyTorch built-in flexibility we were able to quickly iterate over new ideas and test their performance. Later, to our delight, we discovered that by using PyTorch-Ignite we were also able to decrease our boilerplate code, and that in turn allowed us to further accelerate this process.
We weren’t sure at first whether PyTorch would hold as a production ready engine, but our patience paid off when PyTorch 1.0 launched. It became clear that even with the additional flexibility in design and debugging capabilities, there was no hit in terms of raw GPU performance of the framework.
Allegro AI Trains
For infrastructure tools, we chose Allegro AI Trains. And wow, did this pay off for us in dividends!
We tested all the possible open/free/paid offerings we could find for AI / DL management before choosing Trains. Trains was the last experiment manager we tested, just after its beta version was published on GitHub in the midst of CVPR’19.
By that time we had already had quite the experience with other experiment managers and knew our requirements. Nothing is perfect and therefore it was no surprise that the first version of Trains was far from perfect. However it did get two things very right:
- It was zero hassle to integrate our code with (their automagic promise actually delivers).
- If it failed to log/do something, it did not crash our experiments.
At first we saw the code logging and Python packages as a nice tool to better understand performance differences between model training sessions, but soon learned the true value of using Trains. The major leap in the way we manage the development & deployment lifecycle came when we incorporated the Trains Agent into our workflow. The Trains Agent is basically a daemon that spins a container for you and runs your code. The real trick is that YOU never had to build that container!
This was huge for our research team. Before Trains they would constantly harass the DevOps team when they needed a new package updated into their container, codebase that changed, drivers mismatch etc. With Trains and the Trains Agent that was all gone!
Our development process
Implementing our core goals for workflow processes, and leveraging the best that PyTorch and Trains have to offer, here is an overview of the pipeline we built and found to be the most efficient way to develop and deploy into production:
Every researcher has her own dedicated machine where they have full root privileges and in which they install whatever package / library they like. In fact, our data scientists each have a unique set of Python virtual environments for every project they are working on.
Each researcher runs and debugs her code on their own machine, usually with a combination of PyTorch & Tensorboard running locally, including the Trains 2-line init calls at the beginning of their code. Internal debugging is usually done with Tensorboard with all of its nifty features.
Once the code executes correctly, we move to the Trains web UI, where we can clone the experiment (execution/run) with just two clicks, change a few parameters and schedule for execution on one of the many GPUs on-prem / on our cloud machines. The best part is that we never have to package our code base. Trains Agent does that automatically. Additionally, it captures any git uncommitted changes and deploys them. Since PyTorch actually maintains a full set of torch versions for specific CUDA capable devices, the Trains Agent actually matches in real-time the specific package to the HW it is running on, so we always get the best matched PyTorch to the specific HW we are using (and yes, we have many different GPU models here).
The most critical item actually is the last link in the chain: How does one “publish” their model? Here is the problem in a nutshell: When you have a complex system relying on the performance of several modules, you cannot have one module merged into the master tree without first verifying that the entire system performance was not degraded by the change.
This means CI is a must if we want to quickly roll out updates and improvements. So we designed the following repository structure:
- We have one main repository holding the application logic itself. This repository uses the different models we train individually and combines them into one decision making engine.
- Then we have git submodules, each one for different trained models (for example object detection, pose estimation etc.).
For each repository (let’s say object detection), we’ve configured our CI to launch a test experiment every time a pull request is created. This test runs the model against a blind dataset (blind meaning you cannot access the dataset, so we know no one can accidently train on it). If the performance (read: accuracy) result is lower than the current master, the test fails.
This process ensures that we cannot merge a codebase with a model that does not improve our performance. Of course the model itself is not stored in the git repo, but only a link to the model file itself. In fact we don’t even do that. We just store the ID experiment which generated the model. The reason we do that is twofold:
- We want to quickly (i.e. from reading the code) understand by whom / where the model was created.
- We can leverage Trains to automatically upload our trained models to our servers.
This really is an end-to-end solution for us, as we don’t need to worry about where the actual artifact sits.
Once the pull request is merged into the master branch of the specific project, we can commit the submodule into the application repository. This, in turn, triggers a system wide QC process that takes the entire system, runs it against a blind dataset (in our case a set of videos the system can never access) and makes sure that the overall performance is not degraded. With the successful completion of this process, the current state of the application is updated in its master branch.
Advantages of this setup
This process might sound like overkill, but it’s actually quite easy to manage (once it is set up, of course) and is extremely easy to handle.
For example, we very quickly understood we needed to make sure each model is trained on all the different objects equally. Otherwise they tended to underperform on rare objects. Let’s say our object-detection pull request failed on the quality-control pass (this is how we name the model inference test phase). The test itself created a new experiment in Trains, and the CI log contains a direct link to the experiment. Now we can view the actual performance of the model. We also added debug images so that every time the model fails to detect an object we draw the image and the expected / detected results. This debug information is essential for us to understand how the model performs in real life (read blind dataset) and allows us to better understand fail points and eliminate them.
Case in point, as trivial as it sounds, in one of our models we failed to properly balance the data. The DL CI process we set up saved us from deploying a model that — even though it overall performed better — would, in individual cases, completely fail. The debug information was valuable as it helped to quickly identify and subsequently fix the problem and lastly have it all ready for a successful beta with one of our customers.
I hope what we have built at Trigo will inspire you. We are now firm believers in doing CI with deep-learning and are big fans of PyTorch and Allegro Trains — both tools that are the basis for the future state of the art for developing and deploying AI models.
We look forward to seeing you soon in one of our retail deployments 😉