
Model Best Practices
Organizations that retrain existing models more frequently are most likely to measure impact and aggregated model metrics to evaluate model performance.
More frequent retraining of existing models is associated with measuring business impact and using aggregate model metrics to evaluate model performance.
Teams that retrain daily are more likely than those that retrain less frequently to use aggregated model metrics (88%) and to measure business impact (67%) to evaluate model performance. Our survey also found that the software/Internet/telecommunications, financial services, and retail/e-commerce industries are the most likely to measure model performance by business impact.
% of Respondents That Agree
Time To Deploy New Models
Method For Evaluating Model Performance
Figure 13: Companies that retrain daily are more likely than those that retrain less frequently to use aggregated model metrics (88%) and business impact measurement (67%) to evaluate their model performance.
ML teams that identify issues fastest with their models are most likely to use A/B testing when deploying models.
Aggregate metrics are a useful baseline, but as enterprises develop a more robust modeling practice, tools such as “shadow deployments,” ensembling, A/B testing, and even scenario tests can help validate models in challenging edge-case scenarios or rare classes. Here we provide the following definitions:
- • A/B testing: comparing one model to another in training, evaluation, or production. Model A and Model B might differ in architecture, training dataset size, hyperparameters, or some other factor.
- • Shadow deployment: the deployment of two different models simultaneously, where one delivers results to the customer and the developers, and the second only delivers results to the developers for evaluation and comparison.
- • Ensembling: the deployment of multiple models in an “ensemble,” often combined through conditionals or performance-based weighting.
Although small, agile teams at smaller companies may find failure modes, problems in their models, or problems in their data earlier than teams at large enterprises, their validation, testing, and deployment strategies are typically less sophisticated. Thus, with simpler models solving more uniform problems for customers and clients, it’s easier to spot failures. When the system grows to include a large market or even a large engineering staff, complexity and technical debt begin to grow. At scale, scenario tests become essential, and even then it may take longer to detect issues in a more complex system.
% of Respondents That Agree
Time To Discover Model Issues
Method For Deploying Models
Figure 14: Teams that take less than one week to discover model issues are more likely than those that take longer to use A/B testing for deploying models.
Conclusion-Conclusion-Conclusion-
Conclusion