ML & Human Consensus: The Best of Both Worlds

by Tim Dingman, Rachel Han, James Lennon, Thomas Liao and Nikhil Gahlot on September 27th, 2021

ML & Human Consensus: The Best of Both Worlds cover

Introduction to Taxonomy Categorization

Have you ever had trouble finding an item in a grocery store? Maybe you were looking for hash browns in the frozen breakfast section, but it was actually in the frozen vegetable section. Or you were looking for soy sauce in the Asian food section but it was with the condiments.

One of the biggest challenges for retailers and marketplaces is building a catalog by categorizing their millions of products into a complex system of thousands of categories, also referred to as taxonomies. In physical stores, a detailed categorization is required to organize store shelves logically. In this new age of e-commerce and digital marketplaces, proper categorization offers numerous benefits, including enhanced search recommendations, more relevant replacement suggestions, and better compliance with local and federal regulations.

Taxonomy categorization is hard not only because of the countless products in the world that create deeply nested hierarchical categories but also because of the ever-changing nature of taxonomies. It is impossible to have a single subject matter expert who understands the entire catalog enough to categorize each individual item; training a team of experts is not scalable and not speedy enough for today’s demands. What’s more, input data is never perfect and missing information can make it impossible to identify where a product belongs in any given taxonomy.

Product Classification Hierarchy

At Scale, we have worked with leading e-commerce and marketplace companies to label and categorize millions of products into complex taxonomy hierarchies. In this blog post, we describe some of the techniques and lessons we’ve learned to scale these pipelines.

Dynamic Consensus

For these subjective, difficult tasks, we leverage what we call dynamic consensus. With consensus, multiple Taskers are prompted independently to select an answer for the same task. If the Taskers agree, the probability of the answer being correct increases. Simultaneously, however, we assign higher values to the response of a Tasker with historically higher accuracy rates than we do for those that previously have recorded lower accuracy rates. Incorporating historical accuracy scores when computing consensus improves future accuracy.

Traditional Dynamic Consensus

This is where the dynamic part of dynamic consensus comes into play. We algorithmically adjust the number of attempts we request on a task until a certain confidence level is met. If, for example, three Taskers have different responses to the same task, we would ask a fourth or even fifth Tasker to keep attempting that task until we have a high degree of confidence in the answer. This enables us to scale labeling rapidly while maintaining the same quality as would be found using a small team of well-trained subject matter experts.

As the volume of products to be categorized increases, however, the number of nuanced tasks where Taskers do not agree also increases. We needed a way to reduce the number of tasks with disagreements to allow our Reviewers — Taskers that have been promoted as a result of their accuracy — to spend their time most efficiently, reviewing only the truly most difficult tasks.

ML and Human Consensus

For some time, our machine learning (ML) team has leveraged ML-powered pre-labeling and automation in our computer vision annotation pipelines to achieve high-quality labeling. (See this post to learn more about how we scale ML for our computer vision pipeline). Pre-labeling is a common approach for hybrid machine + human labeling processes in which a model predicts an answer, and then humans audit or fix the model predictions for better quality. For discrete tasks such as taxonomy categorization, however, asking Taskers to confirm or change model predictions didn't do much to improve their efficiency. Reviewing categorization tasks took as much time as an attempt from scratch. Instead, we experimented with using model inputs as a separate judgment in our dynamic consensus workflow.

We started by training a classifier model to predict a product’s category out of a list of over 4,000 categories leveraging a golden dataset. A golden dataset is a subset of data that has already been labeled and subjected to multiple review levels to achieve extremely high (98%+) quality. The model trained on the golden dataset subsequently achieved 75% accuracy without fine-tuning on hyperparameters or modifying model architecture. From there, we created a set of experiments with different combinations of ML + human agreement aggregation logic, such as:

  • 1 ML attempt only (with the highest confidence)
  • 1 human attempt + 1 ML attempt (selected based on various confidence thresholds)

For each combination, we asked our highly trusted Reviewers to evaluate accuracy. In our experiments, we found that one human attempt plus one ML attempt at 0.8 confidence provided the highest level of accuracy. Combined with other proprietary quality levers, we are able to deliver full batches of data with 97%+ accuracy at the deepest category level. Such an accuracy rate is extremely hard to accomplish, particularly on a complex taxonomy with more than 4,000 nodes and four to six levels of nested labels. Nonetheless, we were able to achieve this quality bar, with a single human judgment in most cases, by incorporating the model into our pipeline.

ML + Human Dynamic Consensus

Combining this hybrid approach with dynamic consensus, all tasks go through at least one human attempt independent of model prediction. If the model prediction and human attempt achieve consensus given additional validations, the task is finalized. With this approach, we were able to achieve consensus on a majority of tasks and, in fact, increased accuracy compared to that of a human-only workflow because the errors made by our ML models were uncorrelated to errors made by Taskers. This means that the model and human Tasker are rarely incorrect in the same way, which makes agreement between them highly reliable.

Error Overlap in Data Labeling

For the more nuanced or tricky tasks where the model prediction and human attempt have lower agreements, more judgments are added until we gather enough agreement so that the computed confidence score of the task exceeds our target threshold for the project of 97%.


This new ML + human consensus approach greatly reduces the proportion of tasks with lower agreement. The approach also allows our most trusted experts to zero in on the most challenging problems across the entire dataset.

While training the classifier model, we noticed that model quality depends on the quality of the dataset used in training. This reinforces the idea that a data-centric approach to ML is critical. Talk to our team today to understand how we can integrate this model into your data pipelines to increase data quality.