Model Best Practices

Regular retraining allows models to adapt to real-world production conditions. Experienced ML teams use a variety of evaluation metrics to determine when models drift, discover new edge cases, and if models need retraining. As models are updated, A/B testing, shadow deployments, and ensembling can be used to deploy retrained models methodically. Teams that apply the proper evaluation methods and deployment techniques are more likely to discover model issues and retrain their models faster.

Organizations that retrain existing models more frequently are most likely to measure impact and aggregated model metrics to evaluate model performance.

More frequent retraining of existing models is associated with measuring business impact and using aggregate model metrics to evaluate model performance.

Teams that retrain daily are more likely than those that retrain less frequently to use aggregated model metrics (88%) and to measure business impact (67%) to evaluate model performance. Our survey also found that the software/Internet/telecommunications, financial services, and retail/e-commerce industries are the most likely to measure model performance by business impact.

Best Practice 1
Model Evaluation Methods vs. Time To Deployment

% of Respondents That Agree

100%
75%
50%
25%
0%
Daily
Weekly
Monthly
Quarterly
Yearly

Time To Deploy New Models

Method For Evaluating Model Performance

Aggregated Model Metrics
Scenario-based validation
Measure business impact
Don’t evaluate

Figure 13: Companies that retrain daily are more likely than those that retrain less frequently to use aggregated model metrics (88%) and business impact measurement (67%) to evaluate their model performance.

When we confuse the research side, which is developing cooler and better algorithms, with the applied side, which is serving amazing models, predictions at scale to users, that’s when things go wrong. If you start hiring the wrong kind of worker, you build out your applied machine learning team only with the PhDs who develop algorithms, What could possibly go wrong? So, understanding what you’re selling. Understanding whether what you’re selling is general purpose tools for other people to use, or building solutions and selling the solutions. That’s really, really important.
Cassie Kosyrkov

Chief Decision Scientist, Google

Cassie Kosyrkov
You’ve got to take people with the data science skill set and put them next to whoever your professionals are. In one business, it might be doctors or medical scientists, and in another business, it might be salespeople, and in yet another it could be bankers, or it could be traders, or it could be portfolio managers. And they need to work side by side as equal partners in the business.
Martin Chavez

Partner and Vice Chairman, Sixth Street Partners

Martin Chavez
You’ll see lots of times in more complex businesses that there’s not just one application of artificial intelligence, but multiple ones in multiple quadrants that bounce back and forth and inform each other. Maybe a good way of thinking about a desired end state is to have pieces in each quadrant that can inform each other. Autonomous vehicles are about automatic action, but if you look under the hood, there are subsystems that are generating knowledge in their environment, and they then feed back to different parts of the models that decide on the actions. And I think that’s a good sort of paradigm for thinking about complex business use of artificial intelligence.
Catherine Williams

Global Head of iQ, Qualtrics

Catherine Williams
We’re measuring model performance daily to see the impact on the business. We measure model drift. If we want to evaluate further, we would go into the model and make sure the data is fresh and review any parameters that drifted over time. Work on getting a real, concrete problem statement everytime you work on a project that leads to impactful benefits on the company so you’re not wasting your time on projects that aren’t going to get anywhere.
Operations Research Scientist

Nielsen

Operations Research Scientist

ML teams that identify issues fastest with their models are most likely to use A/B testing when deploying models.

Aggregate metrics are a useful baseline, but as enterprises develop a more robust modeling practice, tools such as “shadow deployments,” ensembling, A/B testing, and even scenario tests can help validate models in challenging edge-case scenarios or rare classes. Here we provide the following definitions:

  • • A/B testing: comparing one model to another in training, evaluation, or production. Model A and Model B might differ in architecture, training dataset size, hyperparameters, or some other factor.
  • • Shadow deployment: the deployment of two different models simultaneously, where one delivers results to the customer and the developers, and the second only delivers results to the developers for evaluation and comparison.
  • • Ensembling: the deployment of multiple models in an “ensemble,” often combined through conditionals or performance-based weighting.

Although small, agile teams at smaller companies may find failure modes, problems in their models, or problems in their data earlier than teams at large enterprises, their validation, testing, and deployment strategies are typically less sophisticated. Thus, with simpler models solving more uniform problems for customers and clients, it’s easier to spot failures. When the system grows to include a large market or even a large engineering staff, complexity and technical debt begin to grow. At scale, scenario tests become essential, and even then it may take longer to detect issues in a more complex system.

Best Practice 2
Model Deployment Methods vs. Time To Discover Model Issues

% of Respondents That Agree

100%
75%
50%
25%
0%
Less Than 1 Week
1 - 4 Weeks
1 Month or More

Time To Discover Model Issues

Method For Deploying Models

A/B Testing
Shadow Deployment
Ensembling
None of the Above

Figure 14: Teams that take less than one week to discover model issues are more likely than those that take longer to use A/B testing for deploying models.

I became pretty open to considering A/B testing, king-of-the-hill testing, with AI-based algorithms. Because we got scalability, we got faster decision making. All of a sudden, there were systems that always had humans in the loop, humans doing repetitive tasks that didn’t really need humans in the loop anymore.
Jeff Wilke

Former CEO, Worldwide Consumer, Amazon Chairman and Co-Founder of Re:Build Manufacturing

Jeff Wilke
We’re testing a lot of different agents on our real users to see how they’re doing. What’s really interesting is that we try to compare agents that are trained against really good simulators with agents that are compared with less good simulators. What we want to see is, if your agent is trained on a good simulator, does it do really well in a real A/B test with real users. We’ve been really excited to see this working with real users in A/B tests. It’s very exciting and promising for us.
Oskar Stal

Senior Leader in Personalization, Spotify

Oskar Stal
We would actually deploy a few models into production as a test. It was like an A/B/C test of sorts, with different machine learning models running in production, and we would randomly send the outputs from different models to end users just to see how they perform in real-life scenarios.
Product Manager

Deloitte

Product Manager
We first do offline evaluation to look at the model metrics itself. Based on the results of offline tests, we’ll do online A/B testing. Basically, we set up the treatment and control and then compare the results based on some statistics and thresholds.
ML Engineer

Snap

ML Engineer

Conclusion-Conclusion-Conclusion-

Conclusion

Open next chapter