HomeMethodology

Methodology

Learn how we evaluate and rank language models

SEAL Showdown Methodology

SEAL Showdown is a real-world benchmark that evaluates large language models based on blind human evaluation across real conversations. Our methodology ensures fair and comprehensive comparisons between models.

Evaluation Framework

Our evaluation is based on side-by-side comparisons where users choose between responses from two different models without knowing which model generated each response.

Scoring System

Models are ranked using an ELO-based rating system that accounts for the strength of opponents and provides confidence intervals for each score.

Technical Report

For a detailed explanation of our methodology, please download our Technical Report (PDF).