Scale AI was founded on the belief that better data → better AI. There has recently been a renewed focus on “good” data and its impact on the quality of models in the industry. Andrew Ng, who spoke at our first Transform conference earlier this year, emphasized this point: “We spend a lot of time sourcing and preparing high-quality data before we train the model, just like a chef would spend a lot of time to source and prepare high-quality ingredients before they cook a meal. I feel like a lot of the emphasis of the work in AI should shift to systematic data preparation.”
While we have intuitively understood the importance of data quality, the Google Research Team paper "Everyone wants to do the model work, not the data work’: Data Cascades in High-Stakes AI" published earlier this year was the first we’ve come across that attempts to quantify and provide empirical evidence of “bad” data’s impacts.
Overview of the Paper
AI models increasingly are being applied in a broad range of domains from healthcare to climate change. Based on interviews conducted with 53 global AI practitioners working on a variety of applications, the paper’s researchers define, identify causes of, and present empirical evidence on the problem of data cascades. Data cascades are compounding events caused by data issues that lead to negative, downstream effects, resulting in the accumulation of tech debt over time. For more details on the paper’s findings, we encourage you to read the full research here.
The paper notes that 92% of AI practitioners interviewed experienced at least one data cascade or adverse outcomes caused by problems in a data processing pipeline. The paper also cites several causes of cascades, but the number one cause — cited by 54.7% of the practitioners interviewed — was “interacting with physical world bitterness.” In other words, when AI systems move from their well-defined, static development environments and are deployed in the physical world, the systems are challenged by the dynamic changes inherent in the real world.
Many ML researchers and engineers today are taught to develop models using pristine datasets in academic settings. In real-world applications, however, the data ML engineers must work with is messy or even non-existent. While the paper’s focus is on high-stakes domains with limited resources, even in high-resource settings, ML engineers or data scientists will hack together a data pipeline leveraging whatever tools and services they can find. The lack of unified infrastructure to support comprehensive data operations creates structural failures that ultimately cause data cascades and their commensurate tech debt.
Exacerbating the issue, as noted by Sculley et al., 2015, complex models erode abstraction boundaries for ML systems. Developing and maintaining traditional code already results in tech debt. But that is significantly compounded by ML-specific issues — including data dependencies, analysis debt, and configuration debt — that are much harder to detect, diagnose, and fix. The longer tech debt accumulates, the more pervasive it becomes. The more pervasive the tech debt, the more time and effort ML teams ultimately must spend to resolve issues, resulting in productivity loss and slower deployments.
Where Scale Comes In: Minimizing Data Cascades by Focusing on Data Work
Scale was founded on the belief that for AI and ML, data is the new code. By providing a data-centric, end-to-end solution to manage the entire ML lifecycle, we bridge that skill gap, so ML engineers don’t have to change how they work. By focusing on the data, we enable our customers to focus on the model work, not the data work, minimizing the risk of data cascades and making AI systems safer and more robust.
We started this journey by focusing on data annotation. When we first began providing data annotation solutions to our autonomous vehicle customers, our quality assurance (QA) systems were highly manual. Human QA Agents or Queue Managers would audit and re-check sample tasks and manually assign scores to individual Taskers. Over the years, as we expanded into new use cases and new industries, we have built out a variety of internal tools to automate many of these QA processes including:
With benchmarks, we programmatically take an annotation task we know is correct, create an error in that task (usually a common mistake we’ve seen Taskers make), and serve them as regular tasks to our Taskers. By leveraging benchmark tasks, we can automatically give Taskers scores based on whether they fixed the error before submitting. This then allows us to understand how well each Tasker is performing without manually scoring each individual on a variety of tasks.
Linters are tools designed to catch errors and flag anomalies. At Scale, we leverage a variety of types of linters. For example, a response-based linter looks at a submitted task and checks for possible errors such as annotations that should not exist, like duplicate bounding boxes for the same object. Over time, we have layered response-based linters into the UI to flag issues for our Taskers as they annotate -- for example, warnings when the size of a bounding box appears too large.
Moving beyond our core expertise in data annotation, our team has been hard at work building products to support the entire ML lifecycle. These include:
Collecting diverse and representative data, which is the first step in developing a machine learning model, can be a challenge, especially for content and language use cases where there is much more subjectivity. Leveraging our existing global Tasker base, our text and audio collection workflows are API supported and seamlessly integrate with our data labeling pipeline for seamless end-to-end dataset creation.
While more training data is generally better than less data, once data is collected, ML teams need to select what data to annotate. Today, most ML teams do this either by randomly selecting data to label, or manually parsing through their data to identify valuable examples. Scale Nucleus uses active learning and advanced querying to help ML teams visualize their data, understand over- or under-represented labels and attributes to help mitigate bias, and identify edge cases to address the long-tail challenges of machine learning development. Nucleus also enables model-assisted QA, using model predictions (model errors in particular) to identify errors in labeled data and to improve data quality.
In the video below, we show how active learning powers the Autotag feature of Nucleus to allow users to automatically search for and tag similar images — in this case flowers.
The ability to understand how ML models will react to rare or dangerous real-world scenarios is crucial for safety-critical computer vision applications. However, it’s not always feasible to collect enough examples of edge cases in the real world. Currently in alpha, Scale customers can augment ground-truth training data or generate scenarios from scratch to simulate a variety of cameras, sensors, or environmental conditions such as fog or night-time lighting. With data generation, ML teams can synthesize specific, troublesome scenarios (often pre-labeled) to expose their models to more edge cases and under-represented scenarios than they could otherwise collect.
As we noted in our Series E Announcement, we’ve already seen the impact that the right data infrastructure can bring to our customers. From 10X’ing throughput for Toyota to improving States Title’s document processing model to more than 95%, we have freed up many of our customers’ ML teams to focus on the model work instead of the data work. We’re proud to be trusted by the world’s most ambitious AI teams to build the data foundation that will set up their organizations for success for decades to come.
In the coming weeks, we’ll also dive deeper into some of the topics — like how to solve for the long-tail distributions of data with better data management — on our blog. In the meantime, if you or your team are looking for an end-to-end solution for your datasets, reach out to us. We’d love to hear from you.