Innovations in the field of Machine Learning (ML) continue to be driven by a dynamic research community. The ML team at Scale is proud to contribute to the field through its own research, dataset partnerships with leading Universities such as Oxford (DEBAGREEMENT: Reddit 50K Dataset) and the University of Waterloo and the University of Toronto (CADC Dataset), and collaborations with Stanford University, the University of California, Berkeley, and the University of Washington.
As NeurIPS 2021, gets underway, we are pleased to share details on three papers the ML team have published as part of the conference:
- DEBAGREEMENT: A Comment-Reply Dataset for (Dis)agreement Detection in Online Debates Authors: John Pougué-Biyong, Valentina Semenova, Alexandre Matton, Rachel Han, Aerin Kim, Renaud Lambiotte, Doyne Farmer
- Natural Adversarial Objects Authors: Felix Lau, Sasha Harrison, Nishant Subramani, Aerin Kim, Elliot Branson, Rosanne Liu
- Are We Learning Yet? A Meta-Review of Evaluation Failures Across Machine Learning Authors: Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, Ludwig Schmidt
DEBAGREEMENT is a dataset including 42,894 comment-reply pairings from the popular debate website Reddit, each of which has been labeled with agree, neutral, or disagree. We gathered interactions from five different forums: r/BlackLivesMatter, r/Brexit, r/Climate, r/Democrats, and r/Republican. Comment pairings for each forum were chosen in such a way that they generate a user interaction graph when taken as a whole.
DEBAGREEMENT offers a challenge for Natural Language Processing (NLP) systems because it includes slang, irony, and topic-specific humor, all of which are common in online discussions.
We compared the performance of state-of-the-art language models on a (dis)agreement detection task, and looked at the usage of contextual information that is accessible to the models during training (graph, authorship, and temporal information).
DEBAGREEMENT provides novel opportunities for combining graph-based and text-based machine learning techniques to detect agreements as well as disagreements online, in light of recent research showing that context, such as social context or knowledge graph information, enables language models to perform better on downstream NLP tasks.
- Language models trained on existing datasets underperform when run on real world data. This discrepancy is exposed when evaluating the models using the new DEBAGREEMENT dataset.
- DEBAGREEMENT presents new opportunities for modeling diverse online interactions with text and context (authorship, graph, temporal information). Its graph structure allows for the combination of text-based machine learning (ML) with graph representation learning (GRL) approaches.
- By modeling online discussion forums as graphs of user interactions, researchers can: 1) transform the agreement/disagreement detection problem into a sign link prediction task, and 2) use existing signed graph embedding techniques evaluated on publicly accessible signed graphs such as Epinions and Slashdot. The sentiment and polarization of topics is not static in time and for the first time, this dataset exposes that and shows how realistic topic sentiment evolves over time.
Although cutting-edge object identification approaches have shown impressive performance, models are often vulnerable to adversarial assaults and out-of-distribution data.
Natural Adversarial Objects (NAO) is a novel dataset for assessing the robustness of object identification techniques. NAO comprises 7,934 images and 9,943 objects that are unaltered and depict real-world circumstances, yet cause cutting-edge detection methods to misclassify with high confidence. When compared to the conventional MS-COCO validation set, the mean average precision (mAP) of EfficientDet-D7 lowers by 74.5 percent when tested on NAO.
Furthermore, when evaluating a range of object detection architectures, we discovered that improved performance on the MS-COCO validation set does not always transfer to better performance on the NAO, implying that robustness cannot be gained merely by training a more accurate model.
We look at why NAO cases are tough to identify and classify. Experiments using image patch shuffles demonstrate that models are too sensitive to local texture. Furthermore, when predicting class labels with integrated gradients and background replacement, we discovered that the detection model is dependent on pixel information inside the bounding box and insensitive to background context.
- The NAO dataset presents a difficult robustness assessment challenge for detection models trained on MS-COCO. As part of our evaluation, we tested seven state-of-the-art detection models from a variety of families, and found that they consistently failed to perform accurately on NAO when compared to the MS-COCO validation set, which included both in-distribution and out-of-distribution objects, among other things.
- We also described the process of building such a dataset yourself, which may be used to create similar datasets for any target dataset or pre-trained model.
- The popular MS-COCO dataset has "blind-spots" that make it challenging to appropriately categorize these naturally adversarial objects.
- Additionally, we show that detection models are overly sensitive to local texture but insensitive to background change, which makes them vulnerable to natural adversarial objects.
Evaluation issues often undermine the validity of results in machine learning research. In collaboration with researchers from Stanford University, the University of California, Berkeley, and the University of Washington, we conducted a meta-review of 100+ survey papers to identify common benchmark evaluation problems across subfields. In some cases, several years’ worth of progress in certain fields may be misstated.
Our meta-review surveys evaluation papers reporting on a broad range of subfields, ranging from computer vision to deep reinforcement learning, to recommender systems and natural language processing, and more. We found a consistent set of failure modes, which we organized into a systematic taxonomy. This taxonomy of failure modes helps researchers understand where different types of failures are common and provides a framework for future research.
Our taxonomy distinguishes between “internal” and “external” validity failures. Internal validity failures occur within the context of a single benchmark and raise questions about the consistency or reliability of measurements on that specific benchmark. For example, running baseline methods with less hyperparameter tuning than a newly proposed method can underestimate the performance of existing methods and hence lead to overly optimistic claims of performance improvements. External validity failures, in contrast, describe failures that occur when dealing with multiple benchmarks. External validity failures occur when results from one benchmark do not generalize to other settings. One example is dataset misalignment: if data collected for a benchmark is not representative of the setting where models will be used, then results from that benchmark may not reliably predict performance in the real world.
- Evaluation failures pose a risk to the validity of research findings in machine learning. Our meta-review of 100+ survey papers observed common failure modes across a wide range of subfields.
- While some internal validity failures can be identified in the context of a single benchmark, other external validity failures only appear when looking at multiple datasets.
As we look ahead to 2022, we look forward to supporting more university research teams and driving more research initiatives within Scale. To learn more about our work and the work of many of our partners, our team hosts regular TechTalks on Scale Exchange, our community of ML practitioners. Join the community to learn more and register for upcoming talks. If you’re interested in joining our growing ML team, take a look at our careers page for open positions.