Scale AI was founded on the belief that better data → better AI. There has
recently been a renewed focus on “good” data and its impact on the quality of
models in the industry. Andrew Ng, who spoke at our first Transform conference
earlier this year, emphasized this point: “We spend a lot of time sourcing and
preparing high-quality data before we train the model, just like a chef would
spend a lot of time to source and prepare high-quality ingredients before they
cook a meal. I feel like a lot of the emphasis of the work in AI should shift
to systematic data preparation.”
While we have intuitively understood the importance of data quality, the
Google Research Team paper
"Everyone wants to do the model work, not the data work’: Data Cascades in
published earlier this year was the first we’ve come across that attempts to
quantify and provide empirical evidence of “bad” data’s impacts.
Overview of the Paper
AI models increasingly are being applied in a broad range of domains from
healthcare to climate change. Based on interviews conducted with 53 global AI
practitioners working on a variety of applications, the paper’s researchers
define, identify causes of, and present empirical evidence on the problem of
data cascades. Data cascades are compounding events caused by data issues that
lead to negative, downstream effects, resulting in the accumulation of tech
debt over time. For more details on the paper’s findings, we encourage you to
read the full research
here.
Key Takeaways
The paper notes that 92% of AI practitioners interviewed
experienced at least one data cascade or adverse outcomes caused by problems
in a data processing pipeline. The paper also cites several causes of
cascades, but the number one cause — cited by 54.7% of the practitioners
interviewed — was “interacting with physical world bitterness.” In other
words, when AI systems move from their well-defined, static development
environments and are deployed in the physical world, the systems are
challenged by the dynamic changes inherent in the real world.
Table 1: Prevalence and distribution of data cascades, triggers, and
impacts. IN is short for India, EA & WA for East African and West African
countries respectively, and US is for the United States.
Many ML researchers and engineers today are taught to develop models using
pristine datasets in academic settings. In real-world applications, however,
the data ML engineers must work with is messy or even non-existent. While
the paper’s focus is on high-stakes domains with limited resources, even in
high-resource settings, ML engineers or data scientists will hack together a
data pipeline leveraging whatever tools and services they can find. The lack
of unified infrastructure to support comprehensive data operations creates
structural failures that ultimately cause data cascades and their
commensurate tech debt.
Exacerbating the issue, as noted by
Sculley et al., 2015, complex models erode abstraction boundaries for ML systems. Developing
and maintaining traditional code already results in tech debt. But that is
significantly compounded by ML-specific issues — including data
dependencies, analysis debt, and configuration debt — that are much harder
to detect, diagnose, and fix. The longer tech debt accumulates, the more
pervasive it becomes. The more pervasive the tech debt, the more time and
effort ML teams ultimately must spend to resolve issues, resulting in
productivity loss and slower deployments.
Where Scale Comes In: Minimizing Data Cascades by Focusing on Data Work
Scale was founded on the belief that for AI and ML, data is the new code. By
providing a data-centric, end-to-end solution to manage the entire ML
lifecycle, we bridge that skill gap, so ML engineers don’t have to change
how they work. By focusing on the data, we enable our customers to focus on
the model work, not the data work, minimizing the risk of data cascades and
making AI systems safer and more robust.
Data Annotation
We started this journey by focusing on data annotation. When we first began
providing data annotation solutions to our autonomous vehicle customers, our
quality assurance (QA) systems were highly manual. Human QA Agents or Queue
Managers would audit and re-check sample tasks and manually assign scores to
individual Taskers. Over the years, as we expanded into new use cases and
new industries, we have built out a variety of internal tools to automate
many of these QA processes including:
Automated Benchmarks
With benchmarks, we programmatically take an annotation task we know is
correct, create an error in that task (usually a common mistake we’ve seen
Taskers make), and serve them as regular tasks to our Taskers. By leveraging
benchmark tasks, we can automatically give Taskers scores based on whether
they fixed the error before submitting. This then allows us to understand
how well each Tasker is performing without manually scoring each individual
on a variety of tasks.
Linters
Linters are tools designed to catch errors and flag anomalies. At Scale, we
leverage a variety of types of linters. For example, a response-based linter
looks at a submitted task and checks for possible errors such as annotations
that should not exist, like duplicate bounding boxes for the same object.
Over time, we have layered response-based linters into the UI to flag issues
for our Taskers as they annotate -- for example, warnings when the size of a
bounding box appears too large.
Moving beyond our core expertise in data annotation, our team has been hard
at work building products to support the entire ML lifecycle. These include:
Data Collection
Collecting diverse and representative data, which is the first step in
developing a machine learning model, can be a challenge, especially for
content and language use cases where there is much more subjectivity.
Leveraging our existing global Tasker base, our text and audio collection
workflows are API supported and seamlessly integrate with our data labeling
pipeline for seamless end-to-end dataset creation.
Data Management
While more training data is generally better than less data, once data is
collected, ML teams need to select what data to annotate. Today, most ML
teams do this either by randomly selecting data to label, or manually
parsing through their data to identify valuable examples. Scale Nucleus uses
active learning and advanced querying to help ML teams visualize their data,
understand over- or under-represented labels and attributes to help mitigate
bias, and identify edge cases to address the long-tail challenges of machine
learning development. Nucleus also enables model-assisted QA, using model
predictions (model errors in particular) to identify errors in labeled data
and to improve data quality.
In the video below, we show how active learning powers the Autotag feature
of Nucleus to allow users to automatically search for and tag similar images
— in this case flowers.
Data Generation
The ability to understand how ML models will react to rare or dangerous
real-world scenarios is crucial for safety-critical computer vision
applications. However, it’s not always feasible to collect enough examples
of edge cases in the real world. Currently in alpha, Scale customers can
augment ground-truth training data or generate scenarios from scratch to
simulate a variety of cameras, sensors, or environmental conditions such as
fog or night-time lighting. With data generation, ML teams can synthesize
specific, troublesome scenarios (often pre-labeled) to expose their models
to more edge cases and under-represented scenarios than they could otherwise
collect.
Real-World Impact
As we noted in our
Series E Announcement, we’ve
already seen the impact that the right data infrastructure can bring to our
customers. From 10X’ing throughput for Toyota to improving States Title’s
document processing model to more than 95%, we have freed up many of our
customers’ ML teams to focus on the model work instead of the data work.
We’re proud to be trusted by the world’s most ambitious AI teams to build
the data foundation that will set up their organizations for success for
decades to come.
In the coming weeks, we’ll also dive deeper into some of the topics — like
how to solve for the long-tail distributions of data with better data
management — on our blog. In the meantime, if you or your team are looking
for an end-to-end solution for your datasets,
reach out to us. We’d love to hear
from you.