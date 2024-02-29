Methodology

How we compute the Elo-scale Rankings

We use Elo-scale rankings to compare model performance across some of our datasets. Our human evaluators compare the responses of two models to the same prompt and rate which is better along a multitude of domains / capabilities (see the posts for each dataset for more details). From these ratings we determine which model won, lost, or tied. We follow the same method as Chatbot Arena and use the Bradley-Terry model to perform a (reweighted) maximum likelihood estimation on our data points.

First, some definitions:

Over our M M M models, we let A = { ( m , m ′ ) : m < m ′ , and m , m ′ ∈ [ M ] } A=\{(m,m^{'}): m<m^{'}, \text{ and } m, m^{'} \in [M]\} A={(m,m′):m<m′, and m,m′∈[M]} denote our comparative data set.

At time t ∈ N t \in \mathbb{N} t∈N, we serve the human a pair of models A t ∈ A A_{t}\in A At​∈A and we have our evaluator’s response H t ∈ [ 0 , 0.5 , 1 ] H_{t}\in [0, 0.5, 1] Ht​∈[0,0.5,1]. A 1 means that model m m m is preferred over model m ′ m^{'} m′ and a 0.5 means that the models were equally preferred.

With Bradley-Terry, we use a logistic relationship to model the probability that this is true with:

P ( H t = 1 ) = 1 1 + e ξ m ′ − ξ m P(H_{t}=1)=\cfrac{1}{1+e^{\xi m^{'}-\xi m}} P(Ht​=1)=1+eξm′−ξm1​

Where ξ \xi ξ is an M-length vector of "BT" coefficients. We then want to estimate the BT coefficients by minimizing the binary cross-entropy loss:

s ( P ^ ) = arg min ⁡ ξ E A , H P [ l ( H , 1 1 + e ξ A 2 − ξ A 1 ) ] s(\hat{P}) = \mathop{\operatorname{arg\,min}}\limits_{\xi}\mathbb{E}^{\mathbb{P}}_{A,H}\bigg[l\bigg(H, \cfrac{1}{1+e^{\xi A_{2}-\xi A_{1}}} \bigg)\bigg] s(P^)=ξargmin​EA,HP​[l(H,1+eξA2​−ξA1​1​)]

Where l l l is the binary cross-entropy loss,

l ( h , p ) = − ( h log ⁡ ( p ) + ( 1 − h ) log ⁡ ( 1 − p ) ) l(h,p)=-(h\log(p)+(1-h)\log(1-p)) l(h,p)=−(hlog(p)+(1−h)log(1−p))

Additionally, we’ll minimize this loss while using inverse weighting by P ( A t ) P(A_{t}) P(At​) to target a score with a uniform distribution over A A A. This inverse weighting isn’t strictly necessary, however, as our pairwise comparisons between models are very close to equal. We perform the below formula to get our final BT-score.

s ( P ^ ) = arg min ⁡ ξ ∑ t = 1 T 1 P ( A t ) l ( H t , 1 1 + e ξ A t , 2 − ξ A t , 1 ) s(\hat{P}) = \mathop{\operatorname{arg\,min}}\limits_{\xi}\sum^{T}_{t=1}\cfrac{1}{P(A_{t})}l\bigg(H_{t},\cfrac{1}{1+e^{\xi_{A_{t,2}}-\xi_{A_{t,1}}}}\bigg) s(P^)=ξargmin​∑t=1T​P(At​)1​l(Ht​,1+eξAt,2​​−ξAt,1​​1​)

where A t ~ P A_{t} \text{\textasciitilde} P At​~P. This score is converted to an Elo-scale with the simple conversion 1000 + s ( P ^ ) × 400 1000 + s(\hat{P})\times 400 1000+s(P^)×400 and is sorted to get our final ranking.

Confidence Intervals

To enhance our understanding of the reliability of our Elo-scale Bradley-Terry ratings, we estimate confidence intervals using bootstrapping. Bootstrapping is a resampling technique that allows us to assess the variability of our estimates by repeatedly sampling from the data with replacement.

Here’s how we apply bootstrapping to our Elo-scale rating computation:



Generate Bootstrap Samples: We repeatedly sample our dataset with replacement, creating multiple bootstrap samples. Each sample is the same size as the original dataset but contains some repeated observations due to the nature of sampling with replacement. Compute Elo Ratings for Each Sample: For each bootstrap sample, we compute the Elo-scale ratings using our maximum likelihood estimation method above. Aggregate Results: After computing the Elo-scale ratings for a large number of bootstrap samples (e.g., 1000 rounds), we aggregate the results to estimate the distribution of Elo ratings for each model. Estimate Confidence Intervals: From the aggregated bootstrap results, we determine the confidence intervals for each model’s Elo-scale rating. We use the 2.5th percentile and the 97.5th percentile of the bootstrap distribution to form a 95% confidence interval. This interval provides a range in which we expect the true Elo-scale rating to lie with 95% confidence.

Model endpoints we query

Model Name Model Version Endpoint API Docs GPT-4o gpt-4o https://api.openai.com/v1/chat/completions https://platform.openai.com/docs/models/overview GPT-4 Turbo Preview gpt-4-0125-preview https://api.openai.com/v1/chat/completions https://platform.openai.com/docs/models/overview Gemini 1.0 Pro gemini-1.0-pro-001 Queried through the SDK, which hits: https://generativelanguage.googleapis.com/v1beta/ https://ai.google.dev/gemini-api/docs/models/gemini Gemini 1.5 Pro (May 2024) gemini-1.5-pro-preview-0514 Queried through the SDK which hits: https://aiplatform.googleapis.com/$discovery/rest?version=v1beta1 https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versioning#preview-version Gemini 1.5 Pro (April 2024) gemini-1.5-pro-preview-0409 Queried through the SDK, which hits: https://generativelanguage.googleapis.com/v1beta/ https://ai.google.dev/gemini-api/docs/models/gemini Gemini 1.5 Flash gemini-1.5-flash-preview-0514 Queried through the SDK, which hits: https://aiplatform.googleapis.com/$discovery/rest?version=v1beta1 https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versioning#preview-version Claude 3 Opus claude-3-opus-20240229 Queried through the SDK, which hits: https://api.anthropic.com/v1/messages https://docs.anthropic.com/claude/docs/models-overview Claude 3 Sonnet claude-3-sonnet-20240229 Queried through the SDK, which hits: https://api.anthropic.com/v1/messages https://docs.anthropic.com/claude/docs/models-overview Claude 3.5 Sonnet claude-3-5-sonnet-20240620 Queried through the SDK, which hits: https://api.anthropic.com/v1/messages https://docs.anthropic.com/claude/docs/models-overview Mistral Large mistral-large-2402 Self-hosted CodeLlama 34B Instruct codellama-34b-instruct Self-hosted https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf Llama 3 70B Instruct llama-3-70b-instruct Self-hosted Llama 3.1 405B Instruct llama-3.1-405b-instruct Self-hosted

This approach ensures that our model rankings are not only based on point estimates but also account for the inherent variability in the data, giving us a more comprehensive view of model performance.

1. We scraped GPT-4o 29 days after scraping gpt-4-0125-preview. We believe the overfitting risk is not high because of the short time gap and that model developers were not aware of the scrape.

2. We skipped gpt-4-turbo-2024-04-09, prioritizing evaluating gpt-4o instead.

3. We scraped gemini-1.5-pro-preview-0514 29 days after scraping gemini-1.5-pro-preview-0409. We believe the overfitting risk is not high because of the short time gap and that model developers were not aware of the scrape.