Methodology
How we compute the Elo-scale Rankings
We use Elo-scale rankings to compare model performance across some of our datasets. Our human evaluators compare the responses of two models to the same prompt and rate which is better along a multitude of domains / capabilities (see the posts for each dataset for more details). From these ratings we determine which model won, lost, or tied. We follow the same method as Chatbot Arena and use the Bradley-Terry model to perform a (reweighted) maximum likelihood estimation on our data points.
First, some definitions:
Over our models, we let denote our comparative data set.
At time , we serve the human a pair of models and we have our evaluator’s response . A 1 means that model is preferred over model and a 0.5 means that the models were equally preferred.
With Bradley-Terry, we use a logistic relationship to model the probability that this is true with:
Where is an M-length vector of "BT" coefficients. We then want to estimate the BT coefficients by minimizing the binary cross-entropy loss:
Where is the binary cross-entropy loss,
Additionally, we’ll minimize this loss while using inverse weighting by to target a score with a uniform distribution over . This inverse weighting isn’t strictly necessary, however, as our pairwise comparisons between models are very close to equal. We perform the below formula to get our final BT-score.
where . This score is converted to an Elo-scale with the simple conversion and is sorted to get our final ranking.
Confidence Intervals
To enhance our understanding of the reliability of our Elo-scale Bradley-Terry ratings, we estimate confidence intervals using bootstrapping. Bootstrapping is a resampling technique that allows us to assess the variability of our estimates by repeatedly sampling from the data with replacement.
Here’s how we apply bootstrapping to our Elo-scale rating computation:
- Generate Bootstrap Samples: We repeatedly sample our dataset with replacement, creating multiple bootstrap samples. Each sample is the same size as the original dataset but contains some repeated observations due to the nature of sampling with replacement.
- Compute Elo Ratings for Each Sample: For each bootstrap sample, we compute the Elo-scale ratings using our maximum likelihood estimation method above.
- Aggregate Results: After computing the Elo-scale ratings for a large number of bootstrap samples (e.g., 1000 rounds), we aggregate the results to estimate the distribution of Elo ratings for each model.
- Estimate Confidence Intervals: From the aggregated bootstrap results, we determine the confidence intervals for each model’s Elo-scale rating. We use the 2.5th percentile and the 97.5th percentile of the bootstrap distribution to form a 95% confidence interval. This interval provides a range in which we expect the true Elo-scale rating to lie with 95% confidence.
This approach ensures that our model rankings are not only based on point estimates but also account for the inherent variability in the data, giving us a more comprehensive view of model performance.
Model endpoints we query
1. We scraped GPT-4o 29 days after scraping gpt-4-0125-preview. We believe the overfitting risk is not high because of the short time gap and that model developers were not aware of the scrape.
2. We skipped gpt-4-turbo-2024-04-09, prioritizing evaluating gpt-4o instead.
3. We scraped gemini-1.5-pro-preview-0514 29 days after scraping gemini-1.5-pro-preview-0409. We believe the overfitting risk is not high because of the short time gap and that model developers were not aware of the scrape.