Artificial Intelligence (AI) and machine learning (ML) are increasingly being applied to high-stakes domains. One emergent domain involves applying ML to healthcare to assist doctors in providing better care for patients. In the United States, however, people of color face disparities in access to healthcare, the quality of care received, and overall health outcomes. In dermatology, for example, darker skin tones are underrepresented in dermatology residency programs, textbooks, research, and diagnoses.
We aren’t the first researchers to investigate this problem. Academics such as Dr. Susan Taylor, Joy Boulamwini, and Timnit Gebru have made great strides in identifying racial limitations around critical issues such as facial recognition and healthcare. In reviewing these various papers and studies, we recognized a need for a better understanding of the underlying data fueling this research.
Specifically, we sought to understand how this underrepresentation impacts an ML model’s ability to accurately diagnose various skin conditions. In collaboration with our research partner, MIT Media Lab, we analyzed and evaluated deep neural networks trained on clinical images in dermatology.
The team at Scale annotated 16,577 clinical images sourced from two dermatology atlases — DermaAmin and Atlas Dermatologico — with Fitzpatrick skin type labels. The Fitzpatrick labeling system, while not perfect, is a six-point scale originally developed for classifying sun reactivity of skin phenotype. The Fitzpatrick scale served as the basis for skin color in emojis and, more recently, the Fitzpatrick scale has been used in computer vision applications to evaluate algorithmic fairness and model accuracy. The annotated images represent 114 skin conditions with at least 53 images and a maximum of 653 images per skin condition. The annotations, which we are calling the Fitzpatrick 17k Dataset, have been open-sourced here if you would like to explore the dataset in greater detail.
After the dataset had been labeled, researchers at MIT trained a transfer learning model based on a VGG-16 deep neural network architecture pre-trained on the seminal ImageNet dataset to classify various skin conditions. For more details on how the model was trained, we encourage you to read the full paper here.
The research found that the data used to train a model does matter. By tagging the images with Fitzpatrick labels, researchers found that the dataset contains 3.6 times more images of the two lightest Fitzpatrick skin types than the two darkest Fitzpatrick skin types. The underrepresentation of dark skin images in the dataset — and in dermatology atlases more broadly — led to larger disparities in the model’s ability to correctly diagnose skin conditions involving darker skin tones.
While more research is required to identify where accuracy disparities are greatest across skin types, this research empirically shows that the data a model is trained on matters. More importantly, it matters how data is collected and what type of data is collected. Before developing and deploying large-scale ML models in the healthcare sector, we encourage researchers and practitioners to consider and examine biases in training datasets to ensure healthcare disparities are not unintentionally amplified by these models.