Data Challenges
Data quality is the most challenging part of acquiring data.
When it comes to acquiring training data, the barriers to success for ML teams include challenges with collection, quality, analysis, versioning, and storage. Respondents cited data quality as the most difficult part of acquiring data (33%), closely followed by data collection (31%). These problems have a significant downstream impact on ML efforts because teams often cannot model effectively without quality data.
% of Respondents That Agree
Figure 1: Most respondents cited data quality as the most challenging aspect of acquiring training data, followed by data collection.
Factors contributing to data quality challenges include variety, volume, and noise.
To learn more about what factors contribute to data quality, we explored how data type affects volume and variety. Over one-third (37%) of all respondents said they do not have the variety of data they need to improve model performance. More specifically, respondents working with unstructured data have the biggest challenge getting the variety of data they need to improve model performance. Since a large amount of data generated today is unstructured, it is imperative that teams working in ML develop strategies for managing data quality, particularly for unstructured data.
Perception of Data Volume By Data Type
Neither Too Little Nor Too Much
Too Little Data
Too Much Data
Unstructured Data
Semi-structured Data
Structured Data
Figure 2: Respondents working with unstructured data are more likely than those working with semi-structured or structured data to have too little data.
The majority of respondents said they have problems with their training data. Most (67%) reported that the biggest issue is data noise, and that was followed by data bias (47%) and domain gaps (47%). Only 9% indicated their data is free from noise, bias, and gaps.
% of Respondents That Agree
Figure 3: The majority of respondents have problems with their training data. The top three issues are data noise (67%), data bias (47%), and domain gaps (47%). [Note: Sum total does not add up to 100% as respondents were asked to stack-rank options.]
Curating data to be annotated and annotation quality are the top two challenges for companies preparing data for training models.
When it comes to preparing data to train models, respondents cited curating data (33%) and annotation quality (30%) as their top two challenges. Curating data involves, among other things, removing corrupted data, tagging data with metadata, and identifying what data actually matters for models. Failure to properly curate data before annotation can result in teams spending time and budget annotating data that is irrelevant or unusable for their models. Annotating data involves adding context to raw data to enable ML models to generate predictions based on what they learn from the data. Failure to annotate data at high quality often leads to poor model performance, making annotation quality of paramount importance.
% of Respondents That Agree
Figure 4: Curating data and data quality are the top challenges for companies preparing data for training models. [Note: Sum total does not add up to 100% as respondents were asked to stack-rank options.]
CHAPTER-02-CHAPTER-02-
02 - Data Best Practices