Scale AI And Oxford University's Reddit Dataset Provides Comprehensive Online Discourse Data

byon November 22, 2021

Natural Language Processing and Online Discourse:

Online debates shape our society, and have a profound impact on how people perceive the world. Over 4.2 billion people actively use social media, and many of them participate in online discussions. The breadth and depth of online discourse provide insights and data for researchers and curious people alike.

One way to understand data from online discussions is through the field of Natural Language Processing (NLP). NLP offers techniques to understand textual interactions, and to gain a better understanding of polarization online.

However, current NLP models often struggle with nuanced language in online exchanges, such as slang, sarcasm and topic-specific jokes, making it difficult to understand diverse online interactions. A task called stance detection is useful in this arena to address this challenge.

Stance detection is a text understanding task that consists of classifying the position (or stance) of the author of a piece of text, towards an initial message or statement. In an online setting, these statements might include comments or messages in a thread. The stance is classified into one of three classes: in favor, against, or neutral.


Recent work has shown that stance detection would benefit from context-sensitive approaches. We were not able to find realistic, large, open datasets with contextual information which do not have platform-specific features.

Thus, we created DEBAGREEMENT with conversations from Reddit to better understand the evolution of online debates, and to provide annotated data for agreement and disagreement detection.

The comment or reply function on social media sites does not provide sufficient information to classify an interaction. So, we looked at actual online conversations. Unlike existing datasets for stance detection, DEBAGREEMENT provides realistic online discussions, with diverse writing styles, genres and topics of discussion.

The set is comprised of 42,894 comment-reply pairs from Reddit, annotated with agree, neutral, or disagree labels.

The dataset collects data from five forums on Reddit:

  1. r/BlackLivesMatter
  2. r/Brexit
  3. r/Climate
  4. r/Democrats
  5. r/Republican

We pulled from three subreddits of social movements r/Brexit, r/BlackLivesMatter, r/Climate, and two subreddits of political affiliations r/Republican and r/Democrats.

To ensure we were only using high-quality interactions, we excluded: empty comments, comments from deleted authors, comments that were hidden for user privacy reasons, and comments containing hyperlinks. Further, in order to annotate impactful discussions in a given subreddit, we removed posts with fewer than 10 words and few comments.

This dataset presents opportunities to detect (dis)agreements by leveraging context beyond just text. The set does not rely on platform-specific features, such as retweets or hashtags. The data was collected such that, for each subreddit, the resulting set of interactions formed a multi-edge, temporal graph where nodes are users, and edges represent a comment-reply interaction between two users.


Despite the preprocessing we performed to ensure high-quality annotations, the dataset annotation came with certain obstacles.

We used methods that were developed in-house such as dynamic consensus to ensure high quality annotations. (For more details on dynamic consensus, take a look at our blog on ML + Human Consensus.) Annotating short online exchanges is tough; only 33% of the pairs were annotated with a full agreement among annotators. Accounting for nuanced statements, sarcasm, diverse writing genres, and specific socio-economic contexts is a tall order.


This research presents a challenge for NLP systems, and attempts to argue:

  • Current language models struggle with messy data when it is either scraped directly from social media, or pulled from the human dialogue.
  • Clean and tailored datasets may not have transferable insights for online (dis)agreement detection.

We evaluated the performance of state-of-the-art language models against our new stance detection dataset and assessed the transfer learning potential with other stance detection datasets already available.

State-of-the-art language models are faring well on simple examples, but fail on more complex ones. These models tend to overly rely on superficial cues, such as the first word of the reply.

Moreover, it appears that models trained on clean stance datasets behave poorly on our new dataset, which speaks to the new challenges that our dataset offers.

This research presents promising avenues. Our findings highlight ways to improve language models with contextual information.

We see promise in DEBAGREEMENT as it presents an opportunity to train socially aware language models.

Further Research and Recommendations:

We believe expanding annotations to other areas of discussion, such as consumer products reviews, investor forums or wider political forums could lead to further discoveries and insights about future consumer preferences or investor decision-making.

The dataset also provides the possibility to investigate social theories, understand polarisation, study online social movements and examine how people express their opinions and change their views on social media.

Read The Full Research:

This work was achieved in collaboration with Oxford University, who are the first authors of the paper.

The paper has been accepted to the “Datasets and Benchmarks Track” of Neurips 2021. The paper details each of these issues further and is accessible on OpenReview.

The future of your industry starts here.