Data Best Practices

Data Best Practices

The time to deploy new models and retrain existing ones are leading indicators of how mature ML teams are. Leading ML teams can address multiple data-centric challenges through infrastructure investments, collaboration with data partners, and synthetic approaches that address data quality and volume challenges. Teams that get annotated data faster tend to be able to retrain and deploy existing models to production more frequently.

Companies that invest in data annotation infrastructure can deploy new models, retrain existing ones, and deploy them into production faster.

Our analyses reveal a key AI readiness trend: There’s a linear relationship between how quickly teams can deploy new models, how frequently teams retrain existing models, and how long it takes to get annotated data. Teams that deploy new models quickly (especially in less than one month) tend to get their data annotated faster (in less than one week) than those that take longer to deploy.

Teams that retrain their existing models more frequently (e.g., daily or weekly) tend to get their annotated data more quickly than teams that retrain monthly, quarterly, or yearly.

As discussed in the challenges section, however, it is not just the speed of getting annotated data that matters. ML teams must make sure the correct data is selected for annotation and that the annotations themselves are of high quality. Aligning all three factors—selecting the right data, annotating at high quality, and annotating quickly—takes a concerted effort and an investment in data infrastructure.

Best Practice 1
Time Deploy vs Time to Get Annotated Data

Time to Get Annotated Data

3m +
1m - 3m
1w - 1m
1w
1 Month
1-3 Months
3-6 Months
6-9 Months
9-12 Months
12+ Months

Time to Deploy

Circle Size = % of respondents that agree

Figure 5: Teams that get annotated data faster tend to deploy new models to production faster. [Note: Some respondents included time to get annotated data as part of the time to deploy new models.]

Time Retrain vs Time to Get Annotated Data

Time to Get Annotated Data

3m +
1m - 3m
1w - 1m
1w
Daily
Weekly
Monthly
Quarterly
Yearly

Time to Retrain Existing Models

Circle Size = % of respondents that agree

Figure 6: Teams that get annotated data faster tend to be able to retrain and deploy existing models to production more frequently. [Note: Similar to figure 5, some respondents included time to get annotated data as part of the time to retrain. In the case of retraining, however, teams can still retrain monthly, weekly, or even daily, by annotating data in large batches.]

ML EngineerData is always king. More often than not, that results in better outcomes compared to fancier models. People like to publish papers on fancy models, and those obviously matter, but a lot less. If you can figure out how to get higher-quality data—and more of it—at scale, then fairly simple models will give you better results than the most advanced model in the world running on bad data.
ML Engineer

Meta

ML Engineer
Jack GuoYou identify some scenes that you want to label, which may involve some human-in-the-loop labeling, and you can track the turnaround time, given a certain amount of data. And for model training, you can try to optimize with distributed training to make it faster. So you have a pretty good understanding of how long it takes to train a model on the amount of data that you typically train a model on. And the remaining parts, like evaluation, should be fairly quick.
Jack Guo

Head of Autonomy Platform, Nuro

Jack Guo

ML teams that work closely with annotation partners are most efficient in getting annotated data.

The majority of respondents (81%) said their engineering and ML teams are somewhat or closely integrated with their annotation partners. ML teams that are not at all engaged with annotation partners are the most likely (15% v 9% for teams that work closely) to take greater than three months to get annotated data. Therefore, our results suggest that working closely is the most efficient approach.

Furthermore, ML teams that define annotation requirements and taxonomies early in the process are likely to deploy new models more quickly than those that are involved later. ML teams are often not incentivized or even organizationally structured to work closely with their annotation partners. However, our data suggests that working closely with those partners can help ML teams overcome challenges in data curation and annotation quality, accelerating model deployment.

Best Practice 2

Time to develop new models Vs. Time to define annotation requirements

Time to define requirements

Early

Later

Never

Less Than 1 Month

1 - 3 Months

3 - 6 Months

6 - 9 Months

9 - 12 Months

0%
25%
50%
75%
100%
% of Respondents

Figure 7: Teams that define annotation requirements and taxonomies early in the process are likely to deploy new models more quickly than those that do so later in the process.

Andrew NgDocument your labeling instructions clearly. Instructions illustrated with examples of the concept, such as showing examples of scratched pills, of borderline cases and near misses, and any other confusing examples—having that in your documentation often allows labelers to become much more consistent and systematic in how they label the data.
Andrew Ng

CEO Landing AI

Andrew Ng
Applied ScientistFor this kind of role (ML engineers, applied scientists, data scientists), you need to wear multiple hats. If you think you can read a paper and implement a solution, that’s not good enough. You need to be aware of what the business needs, what data you have, what data you need, and what’s a priority for different people. You have to be that end-to-end owner of your project.
Applied Scientist

Amazon

Applied Scientist

To address data quality and volume challenges, many respondents use synthetic data.

Among the whole sample, 73% of respondents leverage synthetic data for their projects. Of those, 51% use synthetic data to address insufficient examples of edge cases from real-world data, 29% use synthetic data to address privacy or legal issues with real-world data, and 28% use synthetic data to address long lead times to acquire real-world data.

Best Practice 3
Top Uses of Synthetic Data

% of Respondents That Agree

100%
75%
50%
25%
0%
Insufficient Examples of Edge Cases
Privacy or Legal Issues
Long Lead Times to Acquire Data

Figure 8: About half of teams use synthetic data to address insufficient examples of edge cases from real-world data. [Note: this question was multi-select]

Yangbing LiThe power of simulation and assimilation for us is not just to help us validate things, but to really help us provide an extensive set of synthetic data based on data that we’ve already collected and use that to dramatically extend into different cases that you normally wouldn’t see driving on the road. We can put in different weather conditions, vegetation, and environment around the same segment of road and just create a lot more types of data.
Yangbing Li

Senior Vice President of Software, Aurora

Yangbing Li
Data ScientistWe know that we have gaps in our data. We also know how we would approach the problem if we had all the information we needed. We use synthetic data to augment that weakness. And we do so in a way that helps define what features to build or what products to build going forward.
Data Scientist

Apple

Data Scientist

CHAPTER-03-CHAPTER-03-

03 - Model Development & Deployment Challenges

Open next chapter