Data embodies the foundation for all machine learning. Many teams spend most of their time on this step of the ML lifecycle, collecting, analyzing, and annotating datato ensure model performance. In this chapter, we explore some of the key AI challenges ML teams encounter when it comes to data for their initiativesand discuss some of the latest AI trends, including best practices and guidance from leading ML teams for data.
Data quality is the most challenging part of acquiring data.
When it comes to acquiring training data, the barriers to success for ML teams include challenges with collection, quality, analysis, versioning, and storage. Respondents cited data quality as the most difficult part of acquiring data (33%), closely followed by data collection (31%). These problems have a significant downstream impact on ML efforts because teams often cannot model effectively without quality data.
% of Respondents That Agree
Figure 1: Most respondents cited data quality as the most challenging aspect of acquiring training data, followed by data collection.
We have to spend a lot of time sourcing—preparing high-quality data before we train the model. Just as a chef would spend a lot of time to source and prepare high-quality ingredients before they cook a meal, a lot of the emphasis of the work in AI should shift to systematic data preparation.
CEO and Founder, Landing AI
We work with third-party vendors to collect data, and often the data is not exactly what we asked for. We may have asked for people’s faces to be visible, and then they’re not. If we asked for data to be collected in a living room, we need to validate that the data was indeed collected in a living room, and we need to add the right tag to the metadata so that we can actually search for it later. Validating that the data collected is actually what we asked for is a fairly large challenge.
Factors contributing to data quality challenges include variety, volume, and noise.
To learn more about what factors contribute to data quality, we explored how data type affects volume and variety. Over one-third (37%) of all respondents said they do not have the variety of data they need to improve model performance. More specifically, respondents working with unstructured data have the biggest challenge getting the variety of data they need to improve model performance. Since a large amount of data generated today is unstructured, it is imperative that teams working in ML develop strategies for managing data quality, particularly for unstructured data.
Perception of Data Volume By Data Type
Neither Too Little Nor Too Much
Too Little Data
Too Much Data
Figure 2: Respondents working with unstructured data are more likely than those working with semi-structured or structured data to have too little data.
The majority of respondents said they have problems with their training data. Most (67%) reported that the biggest issue is data noise, and that was followed by data bias (47%) and domain gaps (47%). Only 9% indicated their data is free from noise, bias, and gaps.
% of Respondents That Agree
Figure 3: The majority of respondents have problems with their training data. The top three issues are data noise (67%), data bias (47%), and domain gaps (47%). [Note: Sum total does not add up to 100% as respondents were asked to stack-rank options.]
Five tips for data-centric AI development: Make labels consistent, use consensus labeling to spot inconsistencies, clarify labeling instructions, toss out noisy examples (because more data is not always better), and use error analysis to focus on a subset of data to improve.
CEO Landing AI
If AI is anything, it’s something that normalizes an awful lot of data and provides new insights.
Co-Founder, Schmidt Ventures
We have troves of information in different formats and different languages across the world. However, some data is much more valuable to my project than others. And of course, our biggest concern is, how do you get and use data in a privacy-enabled way? If you remove all the PII and unique identifiers, you may not have the variety of data you need.
There are two main issues when it comes to data: quality, and quantity. In terms of quality, because we are using customer data for my project, that data comes with a lot of noise. Sometimes, we just have to adjust our modeling approach to handle that noisy data. In terms of quantity, our data sits in different places but you need to have data in one place.
Curating data to be annotated and annotation quality are the top two challenges for companies preparing data for training models.
When it comes to preparing data to train models, respondents cited curating data (33%) and annotation quality (30%) as their top two challenges. Curating data involves, among other things, removing corrupted data, tagging data with metadata, and identifying what data actually matters for models. Failure to properly curate data before annotation can result in teams spending time and budget annotating data that is irrelevant or unusable for their models. Annotating data involves adding context to raw data to enable ML models to generate predictions based on what they learn from the data. Failure to annotate data at high quality often leads to poor model performance, making annotation quality of paramount importance.
% of Respondents That Agree
Figure 4: Curating data and data quality are the top challenges for companies preparing data for training models. [Note: Sum total does not add up to 100% as respondents were asked to stack-rank options.]
It’s a lot harder for us to apply automation to the data that we obtain, because we get it from external service providers in the healthcare industry who don’t necessarily have maturity on their side in terms of how they think about data feeds. So we have to do a lot of manual auditing to ensure that our data that’s going into these machine learning models is actually the kind of data that we want to be using.
ML Engineer, One Hot Labs
We joke that 80% of data science is data cleaning, but it’s also true. If we find that 80% of a machine learning engineer’s work or 80% of a data scientist’s work is data cleaning, then maybe data cleaning is or has to be a core part of the work of a data scientist or machine learning engineer.
CEO Landing AI
It’s hard to actually see inside of the data. We have large files coming in, and you can do small things like create a GIF file from the data or create some sort of preview to quickly glimpse what’s in your data, but this process requires a fair amount of babysitting from engineers or project managers, and more automation, more visualization tools, and dashboards can really help.
If there is a type of data that is more subjective, then it is more difficult to label. We need a quality labeler that matches our quality standards. Labeling is difficult and there may be ambiguity. To be honest, not 100% resolvable. Depending on the question, there will always be ambiguity. Working on computer vision problems: Users may differ on what a ground truth answer actually is, but can cross-validate labeler answers.
02 - Data Best Practices