scale logo

How we compute the Elo-scale Rankings

We use Elo-scale rankings to compare model performance across some of our datasets. Our human evaluators compare the responses of two models to the same prompt and rate which is better along a multitude of domains / capabilities (see the posts for each dataset for more details). From these ratings we determine which model won, lost, or tied. We follow the same method as Chatbot Arena and use the Bradley-Terry model to perform a (reweighted) maximum likelihood estimation on our data points.

First, some definitions:

Over our MM models, we let A={(m,m):m<m, and m,m[M]}A=\{(m,m^{'}): m<m^{'}, \text{ and } m, m^{'} \in [M]\} denote our comparative data set.

At time tNt \in \mathbb{N}, we serve the human a pair of models AtAA_{t}\in A and we have our evaluator’s response Ht[0,0.5,1]H_{t}\in [0, 0.5, 1]. A 1 means that model mm is preferred over model mm^{'} and a 0.5 means that the models were equally preferred.

With Bradley-Terry, we use a logistic relationship to model the probability that this is true with:

P(Ht=1)=11+eξmξmP(H_{t}=1)=\cfrac{1}{1+e^{\xi m^{'}-\xi m}}

Where ξ\xi is an M-length vector of "BT" coefficients. We then want to estimate the BT coefficients by minimizing the binary cross-entropy loss:

s(P^)=arg minξEA,HP[l(H,11+eξA2ξA1)]s(\hat{P}) = \mathop{\operatorname{arg\,min}}\limits_{\xi}\mathbb{E}^{\mathbb{P}}_{A,H}\bigg[l\bigg(H, \cfrac{1}{1+e^{\xi A_{2}-\xi A_{1}}} \bigg)\bigg]

Where ll is the binary cross-entropy loss,

l(h,p)=(hlog(p)+(1h)log(1p))l(h,p)=-(h\log(p)+(1-h)\log(1-p))

Additionally, we’ll minimize this loss while using inverse weighting by P(At)P(A_{t}) to target a score with a uniform distribution over AA. This inverse weighting isn’t strictly necessary, however, as our pairwise comparisons between models are very close to equal. We perform the below formula to get our final BT-score.

s(P^)=arg minξt=1T1P(At)l(Ht,11+eξAt,2ξAt,1)s(\hat{P}) = \mathop{\operatorname{arg\,min}}\limits_{\xi}\sum^{T}_{t=1}\cfrac{1}{P(A_{t})}l\bigg(H_{t},\cfrac{1}{1+e^{\xi_{A_{t,2}}-\xi_{A_{t,1}}}}\bigg)

where At~PA_{t} \text{\textasciitilde} P. This score is converted to an Elo-scale with the simple conversion 1000+s(P^)×4001000 + s(\hat{P})\times 400 and is sorted to get our final ranking.

Confidence Intervals

To enhance our understanding of the reliability of our Elo-scale Bradley-Terry ratings, we estimate confidence intervals using bootstrapping. Bootstrapping is a resampling technique that allows us to assess the variability of our estimates by repeatedly sampling from the data with replacement.

Here’s how we apply bootstrapping to our Elo-scale rating computation:

  1. Generate Bootstrap Samples: We repeatedly sample our dataset with replacement, creating multiple bootstrap samples. Each sample is the same size as the original dataset but contains some repeated observations due to the nature of sampling with replacement.
  2. Compute Elo Ratings for Each Sample: For each bootstrap sample, we compute the Elo-scale ratings using our maximum likelihood estimation method above.
  3. Aggregate Results: After computing the Elo-scale ratings for a large number of bootstrap samples (e.g., 1000 rounds), we aggregate the results to estimate the distribution of Elo ratings for each model.
  4. Estimate Confidence Intervals: From the aggregated bootstrap results, we determine the confidence intervals for each model’s Elo-scale rating. We use the 2.5th percentile and the 97.5th percentile of the bootstrap distribution to form a 95% confidence interval. This interval provides a range in which we expect the true Elo-scale rating to lie with 95% confidence.

This approach ensures that our model rankings are not only based on point estimates but also account for the inherent variability in the data, giving us a more comprehensive view of model performance.

Model endpoints we query

Model Name
Model Version
Endpoint
API Docs
GPT-4o (May 2024)
gpt-4o-2024-05-13
https://api.openai.com/v1/chat/completions
GPT-4o (August 2024)
gpt-4o-2024-08-06
https://api.openai.com/v1/chat/completions
GPT-4 Turbo Preview
gpt-4-0125-preview
https://api.openai.com/v1/chat/completions
GPT-4o Mini
gpt-4o-mini-2024-07-18
https://api.openai.com/v1/chat/completions
GPT-4
gpt-4-0613
https://api.openai.com/v1/chat/completions
o1-preview
o1-preview-2024-09-12
https://api.openai.com/v1/chat/completions
o1-mini
o1-mini-2024-09-12
https://api.openai.com/v1/chat/completions
Gemini 1.0 Pro
gemini-1.0-pro-001
Queried through the SDK, which hits: https://generativelanguage.googleapis.com/v1beta/
Gemini 1.5 Pro (May 2024)
gemini-1.5-pro-preview-0514
Queried through the SDK which hits: https://aiplatform.googleapis.com/$discovery/rest?version=v1beta1
Gemini 1.5 Pro (April 2024)
gemini-1.5-pro-preview-0409
Queried through the SDK, which hits: https://generativelanguage.googleapis.com/v1beta/
Gemini 1.5 Flash
gemini-1.5-flash-preview-0514
Queried through the SDK, which hits: https://aiplatform.googleapis.com/$discovery/rest?version=v1beta1
Gemini 1.5 Pro (August 27, 2024)
gemini-1.5-pro-exp-0827
Queried through the SDK, which hits: https://generativelanguage.googleapis.com/v1beta/
Claude 3 Opus
claude-3-opus-20240229
Queried through the SDK, which hits: https://api.anthropic.com/v1/messages
Claude 3 Sonnet
claude-3-sonnet-20240229
Queried through the SDK, which hits: https://api.anthropic.com/v1/messages
Claude 3.5 Sonnet
claude-3-5-sonnet-20240620
Queried through the SDK, which hits: https://api.anthropic.com/v1/messages
Mistral Large
mistral-large-2402
Self-hosted
Mistral Large 2
mistral-large-2407
Self-hosted
CodeLlama 34B Instruct
codellama-34b-instruct
Self-hosted
Llama 3 70B Instruct
llama-3-70b-instruct
Self-hosted
Llama 3.1 405B Instruct
llama-3.1-405b-instruct
Self-hosted
Llama 3.1 8B Instruct
llama-3.1-8b-instruct
Self-hosted
Llama 3.2 90B Vision Instruct
llama-3.2-90B-vision-instruct
Self-hosted
Command R+
command-r-plus-08-2024
Queried through Cohere's API

1. We scraped GPT-4o 29 days after scraping gpt-4-0125-preview. We believe the overfitting risk is not high because of the short time gap and that model developers were not aware of the scrape.
2. We skipped gpt-4-turbo-2024-04-09, prioritizing evaluating gpt-4o instead.
3. We scraped gemini-1.5-pro-preview-0514 29 days after scraping gemini-1.5-pro-preview-0409. We believe the overfitting risk is not high because of the short time gap and that model developers were not aware of the scrape.