Model Best Practices
Organizations that retrain existing models more frequently are most likely to measure impact and aggregated model metrics to evaluate model performance.
More frequent retraining of existing models is associated with measuring business impact and using aggregate model metrics to evaluate model performance.
Teams that retrain daily are more likely than those that retrain less frequently to use aggregated model metrics (88%) and to measure business impact (67%) to evaluate model performance. Our survey also found that the software/Internet/telecommunications, financial services, and retail/e-commerce industries are the most likely to measure model performance by business impact.
% of Respondents That Agree
Time To Deploy New Models
Method For Evaluating Model Performance
Figure 13: Companies that retrain daily are more likely than those that retrain less frequently to use aggregated model metrics (88%) and business impact measurement (67%) to evaluate their model performance.
When we confuse the research side, which is developing cooler and better algorithms, with the applied side, which is serving amazing models, predictions at scale to users, that’s when things go wrong. If you start hiring the wrong kind of worker, you build out your applied machine learning team only with the PhDs who develop algorithms, What could possibly go wrong? So, understanding what you’re selling. Understanding whether what you’re selling is general purpose tools for other people to use, or building solutions and selling the solutions. That’s really, really important.
Chief Decision Scientist, Google
You’ve got to take people with the data science skill set and put them next to whoever your professionals are. In one business, it might be doctors or medical scientists, and in another business, it might be salespeople, and in yet another it could be bankers, or it could be traders, or it could be portfolio managers. And they need to work side by side as equal partners in the business.
Partner and Vice Chairman, Sixth Street Partners
You’ll see lots of times in more complex businesses that there’s not just one application of artificial intelligence, but multiple ones in multiple quadrants that bounce back and forth and inform each other. Maybe a good way of thinking about a desired end state is to have pieces in each quadrant that can inform each other. Autonomous vehicles are about automatic action, but if you look under the hood, there are subsystems that are generating knowledge in their environment, and they then feed back to different parts of the models that decide on the actions. And I think that’s a good sort of paradigm for thinking about complex business use of artificial intelligence.
Global Head of iQ, Qualtrics
We’re measuring model performance daily to see the impact on the business. We measure model drift. If we want to evaluate further, we would go into the model and make sure the data is fresh and review any parameters that drifted over time. Work on getting a real, concrete problem statement everytime you work on a project that leads to impactful benefits on the company so you’re not wasting your time on projects that aren’t going to get anywhere.
ML teams that identify issues fastest with their models are most likely to use A/B testing when deploying models.
Aggregate metrics are a useful baseline, but as enterprises develop a more robust modeling practice, tools such as “shadow deployments,” ensembling, A/B testing, and even scenario tests can help validate models in challenging edge-case scenarios or rare classes. Here we provide the following definitions:
- • A/B testing: comparing one model to another in training, evaluation, or production. Model A and Model B might differ in architecture, training dataset size, hyperparameters, or some other factor.
- • Shadow deployment: the deployment of two different models simultaneously, where one delivers results to the customer and the developers, and the second only delivers results to the developers for evaluation and comparison.
- • Ensembling: the deployment of multiple models in an “ensemble,” often combined through conditionals or performance-based weighting.
Although small, agile teams at smaller companies may find failure modes, problems in their models, or problems in their data earlier than teams at large enterprises, their validation, testing, and deployment strategies are typically less sophisticated. Thus, with simpler models solving more uniform problems for customers and clients, it’s easier to spot failures. When the system grows to include a large market or even a large engineering staff, complexity and technical debt begin to grow. At scale, scenario tests become essential, and even then it may take longer to detect issues in a more complex system.
% of Respondents That Agree
Time To Discover Model Issues
Method For Deploying Models
Figure 14: Teams that take less than one week to discover model issues are more likely than those that take longer to use A/B testing for deploying models.
I became pretty open to considering A/B testing, king-of-the-hill testing, with AI-based algorithms. Because we got scalability, we got faster decision making. All of a sudden, there were systems that always had humans in the loop, humans doing repetitive tasks that didn’t really need humans in the loop anymore.
Former CEO, Worldwide Consumer, Amazon Chairman and Co-Founder of Re:Build Manufacturing
We’re testing a lot of different agents on our real users to see how they’re doing. What’s really interesting is that we try to compare agents that are trained against really good simulators with agents that are compared with less good simulators. What we want to see is, if your agent is trained on a good simulator, does it do really well in a real A/B test with real users. We’ve been really excited to see this working with real users in A/B tests. It’s very exciting and promising for us.
Senior Leader in Personalization, Spotify
We would actually deploy a few models into production as a test. It was like an A/B/C test of sorts, with different machine learning models running in production, and we would randomly send the outputs from different models to end users just to see how they perform in real-life scenarios.
We first do offline evaluation to look at the model metrics itself. Based on the results of offline tests, we’ll do online A/B testing. Basically, we set up the treatment and control and then compare the results based on some statistics and thresholds.