Even after a model is deployed, it is important to monitor performance in production. Precision and recall may be relevant to one business, while another might need custom scenario tests to ensure that model failure does not occur in critical circumstances. Tracking these metrics over time to ensure that model drift does not occur is also important, particularly as businesses ship their products to more markets or as changing environmental conditions cause a model to suddenly be out of date. In this chapter, we explore the key challenges ML teams encounter when developing, deploying, and monitoring models, and we discuss some best practices in these areas.
Feature engineering is the biggest challenge in model development.
The majority of respondents consider feature engineering to be the most challenging aspect of model development. Feature engineering is particularly relevant for creating models on structured data, such as predictive models and recommendation systems, while deep learning computer vision systems usually don’t rely on feature engineering.
% of Respondents
Figure 9: Feature engineering is the biggest challenge to model development.
For tabular models, however, feature engineering can require several permutations of logarithms , exponentials, or multiplications across columns. It is important to identify co-linearity across columns, including “engineered” ones, and then choose the better signal and discard the less relevant one. In interviews, ML teams expressed the concern of choosing columns that border on personally identifiable information (PII), in some cases more sensitive data leading toa higher-performing model. Yet in some cases, analogs for PII can be engineered from other nonsensitive columns.
We hypothesize that feature engineering is time-consuming and involves significant cross-functional alignment for teams building recommendation systems and other tabular models, while feature engineering remains a relatively foreign concept for those working on deep learning computer vision systems such as autonomous vehicles.
All the science is infused with engineering, discipline, and careful execution. And all the engineering is infused with the scientific ideas. And the reason for that is because the field is becoming mature, and so it is hard to just do small-scale tinkering without having a lot of engineering skill and effort to really make something work.
Co-Founder and Chief Scientist, Open AI
Turns out that people are just not very good at defining predictive, salient features. And that’s even in problems that they’re actually good at performing themselves, like labeling natural images.
CEO and Co-Founder, Insitro
Feature engineering takes a lot of time, but there’s a lot of scoping and stakeholder alignment that needs to happen before you even get to feature engineering. It’s not technical, but it’s things like working with business teams to make sure you’re understanding the underlying problem and working with legal teams to ensure that you can use certain data to address that problem.
A major aspect of feature engineering is collaborating with subject-matter experts who can actually define the different measures of success. As ML engineers, we are not always the best people to define what success looks like. Being able to partner effectively with these experts is a must-have skill.
For me, the parts that are challenging about feature engineering have to do with cleaning text that is not linguistic. For example, trying to clean Java stack or console logs of the instances where our service is run. Cleaning that is tricky, because we have to remove unique IDs and words that don’t have value. It’s hard to know what that is in a programming language. Sometimes I have to ask people, ‘What in this paragraph matters, or what doesn’t matter?
Measuring the business impact of models is a challenge, especially for smaller companies.
When asked how they evaluate model performance, a majority (80%) of respondents indicated they use aggregated model metrics, but far fewer do so by performing scenario-based validation (56%) or by evaluating business impact (51%). Our analyses showed that smaller companies (those with fewer than 500 employees) are least likely to evaluate model performance by measuring business impact (only 40% use this approach), while large companies (those with more than 10,000 employees) are most likely to evaluate business impact (about 61% do so).
We hypothesize that as organizations grow, there is an increasing need to measure and understand the business impact of ML models. The responses also suggest, however, that there is significant room for improvement across organizations of all sizes to move beyond aggregated model metrics into scenario-based evaluation and business impact measurement.
% of Respondents That Agree
Company Size (# of Employees)
Method For Evaluating Model Performance
Figure 10: Small companies are least likely and large companies are most likely to evaluate model performance by measuring business impact.
We need to shift to business metrics—for example, cost savings or new revenue streams, customer satisfaction, employee productivity, or any other [metric] that involves bringing the technical departments and the business units together in a combined life cycle, where we measure all the way to the process from design to production. MLOps is all about bringing all the roles that are involved in the same lifecycle from data scientists to business owners. When you scale this approach throughout the company, you can go beyond those narrow use cases and truly transform your business.
General Manager, Artificial intelligence and Innovation, Microsoft
I assumed rightly or wrongly that those around me would think that I was only interested in the abstract, interesting mathematical problems at hand, and not in the business. And so I’ve worked hard to prove the opposite and apply my thinking to how best to leverage math or really artificial intelligence and machine learning to drive impact strategically and for businesses.
Global Head of iQ, Qualtrics
There’s intrinsic evaluation and extrinsic evaluation. Let’s say you have a classification model. Intrinsic evaluation will include classification metrics like an F-1 score that’s just telling you whether the model you built is overfitting or underfitting. But you cannot stop there. You need to understand if this model is actually working for the business problem. Has the business improved? Are business metrics improving by using this new model? And that is what I mean by extrinsic evaluation. These two things don’t correlate all the time. A model may perform well on intrinsic metrics, but it’s possible that it’s not actually positively improving the business. The reverse is also possible.
The majority of ML practitioners cite deployment at scale as a challenge.
Deploying was cited by 38% of all respondents as the most challenging part of the deployment and monitoring phase of the ML lifecycle, followed closely by monitoring (34%) and optimizing compute (30%). B2B services rated deploying as most challenging, followed by retail and e-commerce.
Our interviews suggest that deploying models is a relatively rare skill set compared to data engineering, model training, and analytics. One reason: while most recent graduates of undergraduate and postgraduate programs have extensive experience training and tuning models, they have never had to deploy and support them at scale. This is a skill one develops only in the workplace. Typically, at larger organizations, one engineer is responsible for deploying multiple models that span different teams and projects, whereas multiple engineers and data scientists are typically responsible for data and training.
Monitoring model performance in production came in a close second, indicating that, like model deployment, monitoring a distributed set of inference endpoints requires a different skill set than training, tuning, and evaluating models in the experimentation phase. Lastly, optimizing compute was ranked as the third challenge by all groups, which is a testament to the advances in usability of cloud infrastructure, and also the rapid proliferation of expertise in deploying services across cloud providers.
% of Respondents That Agree
Figure 11: Deploying is the most challenging part of model deployment for all teams. [Note: this question was multi select]
A lot of people want to work on the cool AI models, but when it comes down to it, there’s just a lot of really hard work in getting AI into a production-level system. There’s just so much work in labeling the data, cleaning the data, preparing it, compared to standard engineering work and load balancing things and dealing with spikes and so on. A lot of folks want to say they’re working on AI, but they don’t want to do most of these really hard aspects of creating the AI system.
Modeling is the easier part. The hardest part is getting the business to buy in and actually implement the model. That’s what’s stalled some of our projects early on, the basic understanding of how it’s going to impact the end user, because we don’t necessarily work with the end user 100% of the time. Often, we don’t have the chance to get feedback from the user. A lot of the challenge is getting the higher-level stakeholder to buy in. If you don’t have buy-in, the model is forgotten.
Deployment is not a large problem for a large company that has a large infrastructure. But the real issue is can the model help achieve their business goal? There may be offline studies/experiments for a small number of users. Will the model perform as you expect it? You never know. How to ensure you have enough online metrics and the model works correctly.
Large companies take longer to identify issues in models.
Smaller organizations can usually identify issues with their models quickly (in less than one week). Larger companies (those with more than 10,000 employees) may be more likely to measure business impact, but they are also more likely than smaller ones to take longer (one month or more) to identify issues with their models.
It is not that larger companies are less capable—in fact, they may be more capable—but they operate complex systems at scale that hide more complex flaws. While a smaller company may try to serve the same model to five hypothetical customers and experience one failure, a larger enterprise might push out a model to tens of thousands of users across the globe. Failures might be regional or prone to multiple circumstantial factors that are challenging to anticipate or even simulate.
Larger ML infrastructure systems also develop technical debt such that even if engineers are monitoring the real-time performance of their models in production, there may be other infrastructure challenges that occur simply due to serving so many customers at the same time. These challenges might obscure erroneous classifications that a smaller team would have caught. For larger businesses, the fundamental question is: can good MLOps practices around observability overcome the burden of operating at scale?
Company Size vs. Time to Discover Model Issues
Less Than 1 Week
1 - 4 Weeks
1 Month or More
Fewer Than 500 Employees
500 - 999 Employees
1,000 - 4,999 Employees
5,000 - 9,999 Employees
10,000 - 25,000 Employees
More Than 25,000 Employees
Figure 12: Smaller companies (those with fewer than 500 employees) are more likely to identify issues in less than one week. Large companies (with 10,000 or more employees) are more likely to take more than one month to identify issues.
I’ve been at companies that are much smaller in size and companies that are way bigger. Silos get bigger as the companies get bigger. And so, what happens oftentimes at big companies is these kinds of introductions to who our users are going to be, the stakeholders, what are their actual requirements. All of this happens in the beginning. And then off we go in a silo, build a solution, come back six months later, and now the target has moved because all of us are undergoing quick transformations. It’s important to involve the users at every aspect of the journey and establish those checkpoints, just like we would do for anyother external facing users as well.
Vice President of Machine Learning and Data Platform Engineering, CNN
It’s easy to create/deploy models. What is difficult is how to verify that your model actually achieves your business goals.
04 - Model Development & Deployment Best Practices