Data Best Practices
Companies that invest in data annotation infrastructure can deploy new models, retrain existing ones, and deploy them into production faster.
Our analyses reveal a key AI readiness trend: There’s a linear relationship between how quickly teams can deploy new models, how frequently teams retrain existing models, and how long it takes to get annotated data. Teams that deploy new models quickly (especially in less than one month) tend to get their data annotated faster (in less than one week) than those that take longer to deploy.
Teams that retrain their existing models more frequently (e.g., daily or weekly) tend to get their annotated data more quickly than teams that retrain monthly, quarterly, or yearly.
As discussed in the challenges section, however, it is not just the speed of getting annotated data that matters. ML teams must make sure the correct data is selected for annotation and that the annotations themselves are of high quality. Aligning all three factors—selecting the right data, annotating at high quality, and annotating quickly—takes a concerted effort and an investment in data infrastructure.
Time to Get Annotated Data
Time to Deploy
Figure 5: Teams that get annotated data faster tend to deploy new models to production faster. [Note: Some respondents included time to get annotated data as part of the time to deploy new models.]
Time to Get Annotated Data
Time to Retrain Existing Models
Figure 6: Teams that get annotated data faster tend to be able to retrain and deploy existing models to production more frequently. [Note: Similar to figure 5, some respondents included time to get annotated data as part of the time to retrain. In the case of retraining, however, teams can still retrain monthly, weekly, or even daily, by annotating data in large batches.]
ML teams that work closely with annotation partners are most efficient in getting annotated data.
The majority of respondents (81%) said their engineering and ML teams are somewhat or closely integrated with their annotation partners. ML teams that are not at all engaged with annotation partners are the most likely (15% v 9% for teams that work closely) to take greater than three months to get annotated data. Therefore, our results suggest that working closely is the most efficient approach.
Furthermore, ML teams that define annotation requirements and taxonomies early in the process are likely to deploy new models more quickly than those that are involved later. ML teams are often not incentivized or even organizationally structured to work closely with their annotation partners. However, our data suggests that working closely with those partners can help ML teams overcome challenges in data curation and annotation quality, accelerating model deployment.
Time to develop new models Vs. Time to define annotation requirements
Early
Later
Never
Less Than 1 Month
1 - 3 Months
3 - 6 Months
6 - 9 Months
9 - 12 Months
Figure 7: Teams that define annotation requirements and taxonomies early in the process are likely to deploy new models more quickly than those that do so later in the process.
To address data quality and volume challenges, many respondents use synthetic data.
Among the whole sample, 73% of respondents leverage synthetic data for their projects. Of those, 51% use synthetic data to address insufficient examples of edge cases from real-world data, 29% use synthetic data to address privacy or legal issues with real-world data, and 28% use synthetic data to address long lead times to acquire real-world data.
% of Respondents That Agree
Figure 8: About half of teams use synthetic data to address insufficient examples of edge cases from real-world data. [Note: this question was multi-select]
CHAPTER-03-CHAPTER-03-
03 - Model Development & Deployment Challenges