General
General

Doing the Data Work to Mitigate Data Cascades

byon June 7, 2021

Scale AI was founded on the belief that better data → better AI. There has

recently been a renewed focus on “good” data and its impact on the quality of

models in the industry. Andrew Ng, who spoke at our first Transform conference

earlier this year, emphasized this point: “We spend a lot of time sourcing and

preparing high-quality data before we train the model, just like a chef would

spend a lot of time to source and prepare high-quality ingredients before they

cook a meal. I feel like a lot of the emphasis of the work in AI should shift

to systematic data preparation.”


While we have intuitively understood the importance of data quality, the

Google Research Team paper

"Everyone wants to do the model work, not the data work’: Data Cascades in

High-Stakes AI"

published earlier this year was the first we’ve come across that attempts to

quantify and provide empirical evidence of “bad” data’s impacts.

Overview of the Paper



AI models increasingly are being applied in a broad range of domains from

healthcare to climate change. Based on interviews conducted with 53 global AI

practitioners working on a variety of applications, the paper’s researchers

define, identify causes of, and present empirical evidence on the problem of

data cascades. Data cascades are compounding events caused by data issues that

lead to negative, downstream effects, resulting in the accumulation of tech

debt over time. For more details on the paper’s findings, we encourage you to

read the full research

here.

Key Takeaways



The paper notes that 92% of AI practitioners interviewed

experienced at least one data cascade or adverse outcomes caused by problems

in a data processing pipeline. The paper also cites several causes of

cascades, but the number one cause — cited by 54.7% of the practitioners

interviewed — was “interacting with physical world bitterness.” In other

words, when AI systems move from their well-defined, static development

environments and are deployed in the physical world, the systems are

challenged by the dynamic changes inherent in the real world.

Data Cascades Causes

Table 1: Prevalence and distribution of data cascades, triggers, and

impacts. IN is short for India, EA & WA for East African and West African

countries respectively, and US is for the United States.



Many ML researchers and engineers today are taught to develop models using

pristine datasets in academic settings. In real-world applications, however,

the data ML engineers must work with is messy or even non-existent. While

the paper’s focus is on high-stakes domains with limited resources, even in

high-resource settings, ML engineers or data scientists will hack together a

data pipeline leveraging whatever tools and services they can find. The lack

of unified infrastructure to support comprehensive data operations creates

structural failures that ultimately cause data cascades and their

commensurate tech debt.


Exacerbating the issue, as noted by

Sculley et al., 2015, complex models erode abstraction boundaries for ML systems. Developing

and maintaining traditional code already results in tech debt. But that is

significantly compounded by ML-specific issues — including data

dependencies, analysis debt, and configuration debt — that are much harder

to detect, diagnose, and fix. The longer tech debt accumulates, the more

pervasive it becomes. The more pervasive the tech debt, the more time and

effort ML teams ultimately must spend to resolve issues, resulting in

productivity loss and slower deployments.



Where Scale Comes In: Minimizing Data Cascades by Focusing on Data Work


Scale was founded on the belief that for AI and ML, data is the new code. By

providing a data-centric, end-to-end solution to manage the entire ML

lifecycle, we bridge that skill gap, so ML engineers don’t have to change

how they work. By focusing on the data, we enable our customers to focus on

the model work, not the data work, minimizing the risk of data cascades and

making AI systems safer and more robust.

Data Annotation


We started this journey by focusing on data annotation. When we first began

providing data annotation solutions to our autonomous vehicle customers, our

quality assurance (QA) systems were highly manual. Human QA Agents or Queue

Managers would audit and re-check sample tasks and manually assign scores to

individual Taskers. Over the years, as we expanded into new use cases and

new industries, we have built out a variety of internal tools to automate

many of these QA processes including:

Automated Benchmarks


With benchmarks, we programmatically take an annotation task we know is

correct, create an error in that task (usually a common mistake we’ve seen

Taskers make), and serve them as regular tasks to our Taskers. By leveraging

benchmark tasks, we can automatically give Taskers scores based on whether

they fixed the error before submitting. This then allows us to understand

how well each Tasker is performing without manually scoring each individual

on a variety of tasks.

Linters


Linters are tools designed to catch errors and flag anomalies. At Scale, we

leverage a variety of types of linters. For example, a response-based linter

looks at a submitted task and checks for possible errors such as annotations

that should not exist, like duplicate bounding boxes for the same object.

Over time, we have layered response-based linters into the UI to flag issues

for our Taskers as they annotate -- for example, warnings when the size of a

bounding box appears too large.


Moving beyond our core expertise in data annotation, our team has been hard

at work building products to support the entire ML lifecycle. These include:

Data Collection


Collecting diverse and representative data, which is the first step in

developing a machine learning model, can be a challenge, especially for

content and language use cases where there is much more subjectivity.

Leveraging our existing global Tasker base, our text and audio collection

workflows are API supported and seamlessly integrate with our data labeling

pipeline for seamless end-to-end dataset creation.



Data Management


While more training data is generally better than less data, once data is

collected, ML teams need to select what data to annotate. Today, most ML

teams do this either by randomly selecting data to label, or manually

parsing through their data to identify valuable examples. Scale Nucleus uses

active learning and advanced querying to help ML teams visualize their data,

understand over- or under-represented labels and attributes to help mitigate

bias, and identify edge cases to address the long-tail challenges of machine

learning development. Nucleus also enables model-assisted QA, using model

predictions (model errors in particular) to identify errors in labeled data

and to improve data quality.


In the video below, we show how active learning powers the Autotag feature

of Nucleus to allow users to automatically search for and tag similar images

— in this case flowers.



Data Generation


The ability to understand how ML models will react to rare or dangerous

real-world scenarios is crucial for safety-critical computer vision

applications. However, it’s not always feasible to collect enough examples

of edge cases in the real world. Currently in alpha, Scale customers can

augment ground-truth training data or generate scenarios from scratch to

simulate a variety of cameras, sensors, or environmental conditions such as

fog or night-time lighting. With data generation, ML teams can synthesize

specific, troublesome scenarios (often pre-labeled) to expose their models

to more edge cases and under-represented scenarios than they could otherwise

collect.

Real-World Impact


As we noted in our

Series E Announcement, we’ve

already seen the impact that the right data infrastructure can bring to our

customers. From 10X’ing throughput for Toyota to improving States Title’s

document processing model to more than 95%, we have freed up many of our

customers’ ML teams to focus on the model work instead of the data work.

We’re proud to be trusted by the world’s most ambitious AI teams to build

the data foundation that will set up their organizations for success for

decades to come.


In the coming weeks, we’ll also dive deeper into some of the topics — like

how to solve for the long-tail distributions of data with better data

management — on our blog. In the meantime, if you or your team are looking

for an end-to-end solution for your datasets,

reach out to us. We’d love to hear

from you.


The future of your industry starts here.