Scale AI and Research Partner MIT Media Lab Analyze Bias in Dermatological Datasets to Improve Diagnoses

byon June 23, 2021

Artificial Intelligence (AI) and machine learning (ML) are increasingly being

applied to high-stakes domains. One emergent domain involves applying ML to

healthcare to assist doctors in providing better care for patients. In the

United States, however, people of color face disparities in access to

healthcare, the quality of care received, and overall health outcomes. In

dermatology, for example, darker skin tones are underrepresented in

dermatology residency programs, textbooks, research, and diagnoses.

We aren’t the first researchers to investigate this problem. Academics such as

Dr. Susan Taylor,

Joy Boulamwini, and

Timnit Gebru

have made great strides in identifying racial limitations around critical

issues such as facial recognition and healthcare. In reviewing these various

papers and studies, we recognized a need for a better understanding of the

underlying data fueling this research.

Specifically, we sought to understand how this underrepresentation impacts an

ML model’s ability to accurately diagnose various skin conditions. In

collaboration with our research partner, MIT Media Lab, we analyzed and

evaluated deep neural networks trained on clinical images in dermatology.


The team at Scale annotated 16,577 clinical images sourced from two

dermatology atlases — DermaAmin and Atlas Dermatologico — with Fitzpatrick

skin type labels. The Fitzpatrick labeling system, while not perfect, is a

six-point scale originally developed for classifying sun reactivity of skin

phenotype. The Fitzpatrick scale served as the basis for skin color in emojis

and, more recently, the Fitzpatrick scale has been used in computer vision

applications to evaluate algorithmic fairness and model accuracy. The

annotated images represent 114 skin conditions with at least 53 images and a

maximum of 653 images per skin condition. The annotations, which we are

calling the Fitzpatrick 17k Dataset, have been open-sourced

here if you would

like to explore the dataset in greater detail.

The Fitzpatrick Skin Type Scale

After the dataset had been labeled, researchers at MIT trained a transfer

learning model based on a VGG-16 deep neural network architecture

pre-trained on the seminal ImageNet dataset to classify various skin

conditions. For more details on how the model was trained, we encourage you

to read the full paper here.

Research Findings:

The research found that the data used to train a model does matter. By

tagging the images with Fitzpatrick labels, researchers found that the

dataset contains 3.6 times more images of the two lightest Fitzpatrick skin

types than the two darkest Fitzpatrick skin types. The underrepresentation

of dark skin images in the dataset — and in dermatology atlases more broadly

— led to larger disparities in the model’s ability to correctly diagnose

skin conditions involving darker skin tones.

While more research is required to identify where accuracy disparities are

greatest across skin types, this research empirically shows that the data a

model is trained on matters. More importantly, it matters how data is

collected and what type of data is collected. Before developing and

deploying large-scale ML models in the healthcare sector, we encourage

researchers and practitioners to consider and examine biases in training

datasets to ensure healthcare disparities are not unintentionally amplified

by these models.

The future of your industry starts here.