Model Challenges
Even after a model is deployed, it is important to monitor performance in production. Precision and recall may be relevant to one business, while another might need custom scenario tests to ensure that model failure does not occur in critical circumstances. Tracking these metrics over time to ensure that model drift does not occur is also important, particularly as businesses ship their products to more markets or as changing environmental conditions cause a model to suddenly be out of date. In this chapter, we explore the key challenges ML teams encounter when developing, deploying, and monitoring models, and we discuss some best practices in these areas.
Feature engineering is the biggest challenge in model development.
The majority of respondents consider feature engineering to be the most challenging aspect of model development. Feature engineering is particularly relevant for creating models on structured data, such as predictive models and recommendation systems, while deep learning computer vision systems usually don’t rely on feature engineering.
% of Respondents
Figure 9: Feature engineering is the biggest challenge to model development.
For tabular models, however, feature engineering can require several permutations of logarithms , exponentials, or multiplications across columns. It is important to identify co-linearity across columns, including “engineered” ones, and then choose the better signal and discard the less relevant one. In interviews, ML teams expressed the concern of choosing columns that border on personally identifiable information (PII), in some cases more sensitive data leading toa higher-performing model. Yet in some cases, analogs for PII can be engineered from other nonsensitive columns.
We hypothesize that feature engineering is time-consuming and involves significant cross-functional alignment for teams building recommendation systems and other tabular models, while feature engineering remains a relatively foreign concept for those working on deep learning computer vision systems such as autonomous vehicles.
Measuring the business impact of models is a challenge, especially for smaller companies.
When asked how they evaluate model performance, a majority (80%) of respondents indicated they use aggregated model metrics, but far fewer do so by performing scenario-based validation (56%) or by evaluating business impact (51%). Our analyses showed that smaller companies (those with fewer than 500 employees) are least likely to evaluate model performance by measuring business impact (only 40% use this approach), while large companies (those with more than 10,000 employees) are most likely to evaluate business impact (about 61% do so).
We hypothesize that as organizations grow, there is an increasing need to measure and understand the business impact of ML models. The responses also suggest, however, that there is significant room for improvement across organizations of all sizes to move beyond aggregated model metrics into scenario-based evaluation and business impact measurement.
% of Respondents That Agree
Company Size (# of Employees)
Method For Evaluating Model Performance
Figure 10: Small companies are least likely and large companies are most likely to evaluate model performance by measuring business impact.
The majority of ML practitioners cite deployment at scale as a challenge.
Deploying was cited by 38% of all respondents as the most challenging part of the deployment and monitoring phase of the ML lifecycle, followed closely by monitoring (34%) and optimizing compute (30%). B2B services rated deploying as most challenging, followed by retail and e-commerce.
Our interviews suggest that deploying models is a relatively rare skill set compared to data engineering, model training, and analytics. One reason: while most recent graduates of undergraduate and postgraduate programs have extensive experience training and tuning models, they have never had to deploy and support them at scale. This is a skill one develops only in the workplace. Typically, at larger organizations, one engineer is responsible for deploying multiple models that span different teams and projects, whereas multiple engineers and data scientists are typically responsible for data and training.
Monitoring model performance in production came in a close second, indicating that, like model deployment, monitoring a distributed set of inference endpoints requires a different skill set than training, tuning, and evaluating models in the experimentation phase. Lastly, optimizing compute was ranked as the third challenge by all groups, which is a testament to the advances in usability of cloud infrastructure, and also the rapid proliferation of expertise in deploying services across cloud providers.
% of Respondents That Agree
Figure 11: Deploying is the most challenging part of model deployment for all teams. [Note: this question was multi select]
Large companies take longer to identify issues in models.
Smaller organizations can usually identify issues with their models quickly (in less than one week). Larger companies (those with more than 10,000 employees) may be more likely to measure business impact, but they are also more likely than smaller ones to take longer (one month or more) to identify issues with their models.
It is not that larger companies are less capable—in fact, they may be more capable—but they operate complex systems at scale that hide more complex flaws. While a smaller company may try to serve the same model to five hypothetical customers and experience one failure, a larger enterprise might push out a model to tens of thousands of users across the globe. Failures might be regional or prone to multiple circumstantial factors that are challenging to anticipate or even simulate.
Larger ML infrastructure systems also develop technical debt such that even if engineers are monitoring the real-time performance of their models in production, there may be other infrastructure challenges that occur simply due to serving so many customers at the same time. These challenges might obscure erroneous classifications that a smaller team would have caught. For larger businesses, the fundamental question is: can good MLOps practices around observability overcome the burden of operating at scale?
Company Size vs. Time to Discover Model Issues
Less Than 1 Week
1 - 4 Weeks
1 Month or More
Fewer Than 500 Employees
500 - 999 Employees
1,000 - 4,999 Employees
5,000 - 9,999 Employees
10,000 - 25,000 Employees
More Than 25,000 Employees
Figure 12: Smaller companies (those with fewer than 500 employees) are more likely to identify issues in less than one week. Large companies (with 10,000 or more employees) are more likely to take more than one month to identify issues.
CHAPTER-04-CHAPTER-04-
04 - Model Development & Deployment Best Practices