
Innovations in the field of Machine Learning (ML) continue to be driven by a dynamic research community. The ML team at Scale is proud to contribute to the field through its own research, dataset partnerships with leading Universities such as Oxford (DEBAGREEMENT: Reddit 50K Dataset) and the University of Waterloo and the University of Toronto (CADC Dataset), and collaborations with Stanford University, the University of California, Berkeley, and the University of Washington.
As NeurIPS 2021, gets underway, we are pleased to share details on three papers the ML team have published as part of the conference:
DEBAGREEMENT is a dataset including 42,894 comment-reply pairings from the popular debate website Reddit, each of which has been labeled with agree, neutral, or disagree. We gathered interactions from five different forums: r/BlackLivesMatter, r/Brexit, r/Climate, r/Democrats, and r/Republican. Comment pairings for each forum were chosen in such a way that they generate a user interaction graph when taken as a whole.
DEBAGREEMENT offers a challenge for Natural Language Processing (NLP) systems because it includes slang, irony, and topic-specific humor, all of which are common in online discussions.
We compared the performance of state-of-the-art language models on a (dis)agreement detection task, and looked at the usage of contextual information that is accessible to the models during training (graph, authorship, and temporal information).
DEBAGREEMENT provides novel opportunities for combining graph-based and text-based machine learning techniques to detect agreements as well as disagreements online, in light of recent research showing that context, such as social context or knowledge graph information, enables language models to perform better on downstream NLP tasks.
Although cutting-edge object identification approaches have shown impressive performance, models are often vulnerable to adversarial assaults and out-of-distribution data.
Natural Adversarial Objects (NAO) is a novel dataset for assessing the robustness of object identification techniques. NAO comprises 7,934 images and 9,943 objects that are unaltered and depict real-world circumstances, yet cause cutting-edge detection methods to misclassify with high confidence. When compared to the conventional MS-COCO validation set, the mean average precision (mAP) of EfficientDet-D7 lowers by 74.5 percent when tested on NAO.
Furthermore, when evaluating a range of object detection architectures, we discovered that improved performance on the MS-COCO validation set does not always transfer to better performance on the NAO, implying that robustness cannot be gained merely by training a more accurate model.
We look at why NAO cases are tough to identify and classify. Experiments using image patch shuffles demonstrate that models are too sensitive to local texture. Furthermore, when predicting class labels with integrated gradients and background replacement, we discovered that the detection model is dependent on pixel information inside the bounding box and insensitive to background context.
Evaluation issues often undermine the validity of results in machine learning research. In collaboration with researchers from Stanford University, the University of California, Berkeley, and the University of Washington, we conducted a meta-review of 100+ survey papers to identify common benchmark evaluation problems across subfields. In some cases, several years’ worth of progress in certain fields may be misstated.
Our meta-review surveys evaluation papers reporting on a broad range of subfields, ranging from computer vision to deep reinforcement learning, to recommender systems and natural language processing, and more. We found a consistent set of failure modes, which we organized into a systematic taxonomy. This taxonomy of failure modes helps researchers understand where different types of failures are common and provides a framework for future research.
Our taxonomy distinguishes between “internal” and “external” validity failures. Internal validity failures occur within the context of a single benchmark and raise questions about the consistency or reliability of measurements on that specific benchmark. For example, running baseline methods with less hyperparameter tuning than a newly proposed method can underestimate the performance of existing methods and hence lead to overly optimistic claims of performance improvements. External validity failures, in contrast, describe failures that occur when dealing with multiple benchmarks. External validity failures occur when results from one benchmark do not generalize to other settings. One example is dataset misalignment: if data collected for a benchmark is not representative of the setting where models will be used, then results from that benchmark may not reliably predict performance in the real world.
As we look ahead to 2022, we look forward to supporting more university research teams and driving more research initiatives within Scale. To learn more about our work and the work of many of our partners, our team hosts regular TechTalks on Scale Exchange, our community of ML practitioners. Join the community to learn more and register for upcoming talks. If you’re interested in joining our growing ML team, take a look at our careers page for open positions.