scale logo
<- Back to leaderboard

Methodology

How we compute the Elo-scale Rankings

We use Elo-scale rankings to compare model performance across some of our datasets. Our human evaluators compare the responses of two models to the same prompt and rate which is better along a multitude of domains / capabilities (see the posts for each dataset for more details). From these ratings we determine which model won, lost, or tied. We follow the same method as Chatbot Arena and use the Bradley-Terry model to perform a (reweighted) maximum likelihood estimation on our data points.

First, some definitions:

Over our M M models, we let A = { ( m , m ) : m < m ,  and  m , m [ M ] } A=\{(m,m^{'}): m<m^{'}, \text{ and } m, m^{'} \in [M]\} denote our comparative data set.

At time t N t \in \mathbb{N} , we serve the human a pair of models A t A A_{t}\in A and we have our evaluator’s response H t [ 0 , 0.5 , 1 ] H_{t}\in [0, 0.5, 1] . A 1 means that model m m is preferred over model m m^{'} and a 0.5 means that the models were equally preferred.

With Bradley-Terry, we use a logistic relationship to model the probability that this is true with:

P ( H t = 1 ) = 1 1 + e ξ m ξ m P(H_{t}=1)=\cfrac{1}{1+e^{\xi m^{'}-\xi m}}

Where ξ \xi is an M-length vector of "BT" coefficients. We then want to estimate the BT coefficients by minimizing the binary cross-entropy loss:

s ( P ^ ) = arg min ξ E A , H P [ l ( H , 1 1 + e ξ A 2 ξ A 1 ) ] s(\hat{P}) = \mathop{\operatorname{arg\,min}}\limits_{\xi}\mathbb{E}^{\mathbb{P}}_{A,H}\bigg[l\bigg(H, \cfrac{1}{1+e^{\xi A_{2}-\xi A_{1}}} \bigg)\bigg]

Where l l is the binary cross-entropy loss,

l ( h , p ) = ( h log ( p ) + ( 1 h ) log ( 1 p ) ) l(h,p)=-(h\log(p)+(1-h)\log(1-p))

Additionally, we’ll minimize this loss while using inverse weighting by P ( A t ) P(A_{t}) to target a score with a uniform distribution over A A . This inverse weighting isn’t strictly necessary, however, as our pairwise comparisons between models are very close to equal. We perform the below formula to get our final BT-score.

s ( P ^ ) = arg min ξ t = 1 T 1 P ( A t ) l ( H t , 1 1 + e ξ A t , 2 ξ A t , 1 ) s(\hat{P}) = \mathop{\operatorname{arg\,min}}\limits_{\xi}\sum^{T}_{t=1}\cfrac{1}{P(A_{t})}l\bigg(H_{t},\cfrac{1}{1+e^{\xi_{A_{t,2}}-\xi_{A_{t,1}}}}\bigg)

where A t ~ P A_{t} \text{\textasciitilde} P . This score is converted to an Elo-scale with the simple conversion 1000 + s ( P ^ ) × 400 1000 + s(\hat{P})\times 400 and is sorted to get our final ranking.

Confidence Intervals

To enhance our understanding of the reliability of our Elo-scale Bradley-Terry ratings, we estimate confidence intervals using bootstrapping. Bootstrapping is a resampling technique that allows us to assess the variability of our estimates by repeatedly sampling from the data with replacement.

Here’s how we apply bootstrapping to our Elo-scale rating computation:

  1. Generate Bootstrap Samples: We repeatedly sample our dataset with replacement, creating multiple bootstrap samples. Each sample is the same size as the original dataset but contains some repeated observations due to the nature of sampling with replacement.
  2. Compute Elo Ratings for Each Sample: For each bootstrap sample, we compute the Elo-scale ratings using our maximum likelihood estimation method above.
  3. Aggregate Results: After computing the Elo-scale ratings for a large number of bootstrap samples (e.g., 1000 rounds), we aggregate the results to estimate the distribution of Elo ratings for each model.
  4. Estimate Confidence Intervals: From the aggregated bootstrap results, we determine the confidence intervals for each model’s Elo-scale rating. We use the 2.5th percentile and the 97.5th percentile of the bootstrap distribution to form a 95% confidence interval. This interval provides a range in which we expect the true Elo-scale rating to lie with 95% confidence.

This approach ensures that our model rankings are not only based on point estimates but also account for the inherent variability in the data, giving us a more comprehensive view of model performance.

Model endpoints we query

Model Name
Model Version
Endpoint
API Docs
GPT-4o (May 2024)
gpt-4o-2024-05-13
https://api.openai.com/v1/chat/completions
GPT-4o (August 2024)
gpt-4o-2024-08-06
https://api.openai.com/v1/chat/completions
ChatGPT-4o-latest (November 2024)
chatgpt-4o-latest
https://api.openai.com/v1/chat/completions
GPT-4 Turbo Preview
gpt-4-0125-preview
https://api.openai.com/v1/chat/completions
GPT-4o Mini
gpt-4o-mini-2024-07-18
https://api.openai.com/v1/chat/completions
GPT-4
gpt-4-0613
https://api.openai.com/v1/chat/completions
GPT-4 (November 2024)
gpt-4-1106-preview
https://api.openai.com/v1/chat/completions
o1-preview
o1-preview-2024-09-12
https://api.openai.com/v1/chat/completions
o1-mini
o1-mini-2024-09-12
https://api.openai.com/v1/chat/completions
Gemini 1.0 Pro
gemini-1.0-pro-001
Queried through the SDK, which hits: https://generativelanguage.googleapis.com/v1beta/
Gemini 1.5 Pro (May 2024)
gemini-1.5-pro-preview-0514
Queried through the SDK which hits: https://aiplatform.googleapis.com/$discovery/rest?version=v1beta1
Gemini 1.5 Pro (April 2024)
gemini-1.5-pro-preview-0409
Queried through the SDK, which hits: https://generativelanguage.googleapis.com/v1beta/
Gemini 1.5 Pro (August 27, 2024)
gemini-1.5-pro-exp-0827
Queried through the SDK, which hits: https://generativelanguage.googleapis.com/v1beta/
Gemini 1.5 Pro (November, 2024)
gemini-1.5-pro
Queried through the SDK, which hits: https://generativelanguage.googleapis.com/v1beta/
Gemini 1.5 Flash
gemini-1.5-flash-preview-0514
Queried through the SDK, which hits: https://aiplatform.googleapis.com/$discovery/rest?version=v1beta1
Gemini-1.5-Flash-002
gemini-1.5-flash-002
Queried through the SDK, which hits: https://aiplatform.googleapis.com/$discovery/rest?version=v1beta1
Gemini Pro Flash 2
gemini-2.0-flash-exp
Queried through the SDK, which hits: https://generativelanguage.googleapis.com/v1beta/
Gemma 2 27B
gemma-2-27b
Queried through the SDK, which hits: https://generativelanguage.googleapis.com/v1beta/
Claude 3 Opus
claude-3-opus-20240229
Queried through the SDK, which hits: https://api.anthropic.com/v1/messages
Claude 3 Sonnet
claude-3-sonnet-20240229
Queried through the SDK, which hits: https://api.anthropic.com/v1/messages
Claude 3.5 Sonnet (June 2024)
claude-3-5-sonnet-20240620
Queried through the SDK, which hits: https://api.anthropic.com/v1/messages
Claude 3.5 Sonnet (October 2024)
claude-3-5-sonnet-20241022
Queried through the SDK, which hits: https://api.anthropic.com/v1/messages
Mistral Large
mistral-large-2402
Self-hosted
Mistral Large 2
mistral-large-2407
Self-hosted
Pixtral 12B (September 2024)
Pixtral-12B-2409
Self-hosted
Pixtral Large (November 2024)
Pixtral-Large-Instruct-2411
Self-hosted
CodeLlama 34B Instruct
codellama-34b-instruct
Self-hosted
Llama 3 70B Instruct
llama-3-70b-instruct
Self-hosted
Llama 3.1 405B Instruct
llama-3.1-405b-instruct
Self-hosted
Llama 3.1 8B Instruct
llama-3.1-8b-instruct
Self-hosted
Llama 3.2 90B Vision Instruct
llama-3.2-90B-vision-instruct
Self-hosted
Llama 3.2 11B Vision Instruct
llama-3.2-11B-vision-instruct
Self-hosted
Llama 3.3
llama-3.3-70b-instruct
Self-hosted
Command R+
command-r-plus-08-2024
Queried through Cohere's API
Aya 23 35B
aya-23-35B
Queried through Cohere's API
Aya 23 35B
aya-23-35B
Queried through Cohere's API
DeepSeek V2 Chat
DeepSeek-V2-Chat
Self-hosted
Yi 1.5 34B Chat
Yi-1.5-34B-Chat
Self-hosted
Qwen 2 72B Instruct
Qwen2-72B-Instruct
Self-hosted
Jais Adapted 70B
jais-adapted-70b
Self-hosted
Phi 3.5 Vision-Instruct
phi-3.5-vision-instruct
Self-hosted

1. We scraped GPT-4o 29 days after scraping gpt-4-0125-preview. We believe the overfitting risk is not high because of the short time gap and that model developers were not aware of the scrape.
2. We skipped gpt-4-turbo-2024-04-09, prioritizing evaluating gpt-4o instead.
3. We scraped gemini-1.5-pro-preview-0514 29 days after scraping gemini-1.5-pro-preview-0409. We believe the overfitting risk is not high because of the short time gap and that model developers were not aware of the scrape.
4. Note that this is NOT the newest Aya model and we are actively working on evaluating Aya Expanse