Data Best Practices
The time to deploy new models and retrain existing ones are leading indicators of how mature ML teams are. Leading ML teams can address multiple data-centric challenges through infrastructure investments, collaboration with data partners, and synthetic approaches that address data quality and volume challenges. Teams that get annotated data faster tend to be able to retrain and deploy existing models to production more frequently.
Companies that invest in data annotation infrastructure can deploy new models, retrain existing ones, and deploy them into production faster.
Our analyses reveal a key AI readiness trend: There’s a linear relationship between how quickly teams can deploy new models, how frequently teams retrain existing models, and how long it takes to get annotated data. Teams that deploy new models quickly (especially in less than one month) tend to get their data annotated faster (in less than one week) than those that take longer to deploy. Teams that retrain their existing models more frequently (e.g., daily or weekly) tend to get their annotated data more quickly than teams that retrain monthly, quarterly, or yearly.
As discussed in the challenges section, however, it is not just the speed of getting annotated data that matters. ML teams must make sure the correct data is selected for annotation and that the annotations themselves are of high quality. Aligning all three factors—selecting the right data, annotating at high quality, and annotating quickly—takes a concerted effort and an investment in data infrastructure.
Time to Get Annotated Data
Time to Deploy
Figure 5: Teams that get annotated data faster tend to deploy new models to production faster. [Note: Some respondents included time to get annotated data as part of the time to deploy new models.]
Time to Get Annotated Data
Time to Retrain Existing Models
Figure 6: Teams that get annotated data faster tend to be able to retrain and deploy existing models to production more frequently. [Note: Similar to figure 5, some respondents included time to get annotated data as part of the time to retrain. In the case of retraining, however, teams can still retrain monthly, weekly, or even daily, by annotating data in large batches.]
Data is always king. More often than not, that results in better outcomes compared to fancier models. People like to publish papers on fancy models, and those obviously matter, but a lot less. If you can figure out how to get higher-quality data—and more of it—at scale, then fairly simple models will give you better results than the most advanced model in the world running on bad data.
You identify some scenes that you want to label, which may involve some human-in-the-loop labeling, and you can track the turnaround time, given a certain amount of data. And for model training, you can try to optimize with distributed training to make it faster. So you have a pretty good understanding of how long it takes to train a model on the amount of data that you typically train a model on. And the remaining parts, like evaluation, should be fairly quick.
Head of Autonomy Platform, Nuro
ML teams that work closely with annotation partners are most efficient in getting annotated data.
The majority of respondents (81%) said their engineering and ML teams are somewhat or closely integrated with their annotation partners. ML teams that are not at all engaged with annotation partners are the most likely (15% v 9% for teams that work closely) to take greater than three months to get annotated data. Therefore, our results suggest that working closely is the most efficient approach.
Furthermore, ML teams that define annotation requirements and taxonomies early in the process are likely to deploy new models more quickly than those that are involved later. ML teams are often not incentivized or even organizationally structured to work closely with their annotation partners. However, our data suggests that working closely with those partners can help ML teams overcome challenges in data curation and annotation quality, accelerating model deployment.
Time to develop new models Vs. Time to define annotation requirements
Less Than 1 Month
1 - 3 Months
3 - 6 Months
6 - 9 Months
9 - 12 Months
Figure 7: Teams that define annotation requirements and taxonomies early in the process are likely to deploy new models more quickly than those that do so later in the process.
Document your labeling instructions clearly. Instructions illustrated with examples of the concept, such as showing examples of scratched pills, of borderline cases and near misses, and any other confusing examples—having that in your documentation often allows labelers to become much more consistent and systematic in how they label the data.
CEO Landing AI
For this kind of role (ML engineers, applied scientists, data scientists), you need to wear multiple hats. If you think you can read a paper and implement a solution, that’s not good enough. You need to be aware of what the business needs, what data you have, what data you need, and what’s a priority for different people. You have to be that end-to-end owner of your project.
To address data quality and volume challenges, many respondents use synthetic data.
Among the whole sample, 73% of respondents leverage synthetic data for their projects. Of those, 51% use synthetic data to address insufficient examples of edge cases from real-world data, 29% use synthetic data to address privacy or legal issues with real-world data, and 28% use synthetic data to address long lead times to acquire real-world data.
% of Respondents That Agree
Figure 8: About half of teams use synthetic data to address insufficient examples of edge cases from real-world data. [Note: this question was multi-select]
The power of simulation and assimilation for us is not just to help us validate things, but to really help us provide an extensive set of synthetic data based on data that we’ve already collected and use that to dramatically extend into different cases that you normally wouldn’t see driving on the road. We can put in different weather conditions, vegetation, and environment around the same segment of road and just create a lot more types of data.
Senior Vice President of Software, Aurora
We know that we have gaps in our data. We also know how we would approach the problem if we had all the information we needed. We use synthetic data to augment that weakness. And we do so in a way that helps define what features to build or what products to build going forward.
03 - Model Development & Deployment Challenges