Research

Introducing SEAL Showdown: Real People, Real Conversations, Real Rankings

byonSeptember 22, 2025

Every few weeks, a new large language model (LLM) is released, each claiming to be smarter, safer, and more capable. Adoption has exploded, with people relying on LLMs for everything from personal tasks to critical work applications.

But here’s the catch: the way we evaluate these models hasn’t kept up. Most benchmarks rely on synthetic tests (coding puzzles, math problems) or feedback from a small slice of people – tech enthusiasts, fixated on edge cases. They miss the full spectrum of how real people actually use models in their daily lives. This leaves a simple question that remains surprisingly hard to answer: which models actually work best for the people who use them?

That’s where SEAL Showdown comes in. Today, we're expanding our SEAL Leaderboards with a new public leaderboard – one built on real preferences, real people, and real-world use across practical categories like country, education level, age, language, and soon profession.

Public Leaderboards Have Fallen Behind

Today’s most popular public leaderboards rely heavily on hobbyist participation. The rankings are based on a narrow group of users and their interests, which can misrepresent how models actually perform for the general public.

The core problem is a lack of context. These leaderboards rely on platforms that typically don't capture detailed information about who is contributing to the rankings and why. By treating diverse users as a monolith and lumping all feedback into one generalized score, critical nuance is lost. This leaves key questions unanswered: How do models perform for different kinds of people — across countries, languages, professions, ages, and education levels?

How SEAL Showdown Is Different

Here’s what sets SEAL Showdown apart from other public leaderboards: 

  1. Representative rankings from a global user base: SEAL Showdown is based on millions of real-world conversations from Scale’s diverse global contributor network, not limited to hobbyists or tech enthusiasts. Rankings are based on preferences from users spanning over 100 countries, 70 languages, and 200 professional domains. 

  2. Granular insights by demographics and markets: Showdown introduces something never before seen in public leaderboards: rich user segmentation. Because rankings are derived from conversations that contributors have on Scale’s Outlier platform, Scale is able to verify each user’s country, education level, profession, language, and age – enabling anyone to see how models perform for people like them.

    Some insights we’ve observed from our data so far:

    • Regional: ChatGPT leads in Europe, while Claude and ChatGPT are tied for #1 across other continents. In Africa and Oceania, Gemini also pulls ahead to join them in the top spot.
    • Language: Gemini performs better with non-English users than with English users.
    • Demographics: ChatGPT leads with users aged 30-50, Claude and ChatGPT are tied among 18-30 year olds, while Gemini rises to tie with Claude and GPT among users 50+.

    For the first time, people can begin to see which models work best for their own background, goals, and use cases - not just what performs best on a generic global average. And for model builders, this offers insights into how performance varies across demographics and domains – insights they can use to guide future improvements.

  3. Trustworthy by design: Scale is pioneering new safeguards to ensure Showdown remains a fair, trustworthy signal of real-world model performance.

    First, we prevent overfitting. Showdown’s data is carefully controlled: Scale will never sell, license, or share data from the same distribution as the live leaderboard within the past 60 days. This means model developers can’t simply tune their systems to “game” the rankings - to win, they must actually perform well in the wild.

    Second, we ensure votes reflect authentic user preferences, not forced feedback. All of the rankings come from Playground — an app we provide to contributors that gives them free, everyday access to frontier models. While using Playground, contributors may occasionally be asked to compare two responses side by side, but voting is entirely optional. They can opt out of voting altogether, and even when voting is enabled, Playground only prompts intermittently — with the option to skip each time. This way, every vote reflects genuine choice, not obligation.

A New Standard for the AI Era

SEAL Showdown sets a new bar for what model evaluation can be: globally representative, real-world breakdowns, and hard to game.

As the AI ecosystem races forward, the world needs benchmarks that reflect reality - not just synthetic tests or niche communities. With SEAL Showdown, Scale is building the foundation for a future where AI is judged by how well it works for all of us.

Want to dive deeper into SEAL Showdown? Explore the full SEAL Showdown Technical Report for methodology, results, and insights.


The future of your industry starts here