Artificial Intelligence (AI) and machine learning (ML) are increasingly being
applied to high-stakes domains. One emergent domain involves applying ML to
healthcare to assist doctors in providing better care for patients. In the
United States, however, people of color face disparities in access to
healthcare, the quality of care received, and overall health outcomes. In
dermatology, for example, darker skin tones are underrepresented in
dermatology residency programs, textbooks, research, and diagnoses.
We aren’t the first researchers to investigate this problem. Academics such as
Joy Boulamwini, and
have made great strides in identifying racial limitations around critical
issues such as facial recognition and healthcare. In reviewing these various
papers and studies, we recognized a need for a better understanding of the
underlying data fueling this research.
Specifically, we sought to understand how this underrepresentation impacts an
ML model’s ability to accurately diagnose various skin conditions. In
collaboration with our research partner, MIT Media Lab, we analyzed and
evaluated deep neural networks trained on clinical images in dermatology.
The team at Scale annotated 16,577 clinical images sourced from two
dermatology atlases — DermaAmin and Atlas Dermatologico — with Fitzpatrick
skin type labels. The Fitzpatrick labeling system, while not perfect, is a
six-point scale originally developed for classifying sun reactivity of skin
phenotype. The Fitzpatrick scale served as the basis for skin color in emojis
and, more recently, the Fitzpatrick scale has been used in computer vision
applications to evaluate algorithmic fairness and model accuracy. The
annotated images represent 114 skin conditions with at least 53 images and a
maximum of 653 images per skin condition. The annotations, which we are
calling the Fitzpatrick 17k Dataset, have been open-sourced
here if you would
like to explore the dataset in greater detail.
After the dataset had been labeled, researchers at MIT trained a transfer
learning model based on a VGG-16 deep neural network architecture
pre-trained on the seminal ImageNet dataset to classify various skin
conditions. For more details on how the model was trained, we encourage you
to read the full paper here.
The research found that the data used to train a model does matter. By
tagging the images with Fitzpatrick labels, researchers found that the
dataset contains 3.6 times more images of the two lightest Fitzpatrick skin
types than the two darkest Fitzpatrick skin types. The underrepresentation
of dark skin images in the dataset — and in dermatology atlases more broadly
— led to larger disparities in the model’s ability to correctly diagnose
skin conditions involving darker skin tones.
While more research is required to identify where accuracy disparities are
greatest across skin types, this research empirically shows that the data a
model is trained on matters. More importantly, it matters how data is
collected and what type of data is collected. Before developing and
deploying large-scale ML models in the healthcare sector, we encourage
researchers and practitioners to consider and examine biases in training
datasets to ensure healthcare disparities are not unintentionally amplified
by these models.