<- Back to leaderboard

Instruction Following

Introduction

The Precise Instructions Following Prompts Dataset is composed of 1,054 instruction following prompts aimed at assessing the ability of AI models to interpret and execute detailed commands, focusing on precision and specificity.

A popular method for assessing LLMs on instructions following tasks the IFEval benchmark, which focuses on evaluating LLMs using prompts containing programmatically verifiable instructions. However, scenarios in this benchmark are limited due to the requirement of being automatically evaluable. Additionally, similar to other open source benchmarks, IFEval is prone to overfitting.

To address these limitations, we built the Scale AI Precise Instructions Following Prompts Dataset. This is a set of private instruction-following prompts, intended to be paired with human evaluations. This dataset includes 1,054 instructions following prompts grouped in 9 categories, including “act as if”, content creation and brainstorming, and covering real applications and use cases for instructions following tasks. It was generated by a diverse group of over 40 human annotators and developed through a five-step process to ensure the final prompts tested the model’s capability to understand and execute instructions with specificity. The ultimate intent is to run human evaluations on models’ responses to this prompt set.

Dataset Description

The dataset comprises 1,054 prompts designed for single-turn instruction following, intended to evaluate nuanced directive comprehension by the model. It tests the model's ability to execute complex instructions with clarity and specificity.

The construction of this dataset posed challenges in maintaining prompt diversity and specificity, addressed by a review process to ensure each prompt's uniqueness. The dataset covers a broad spectrum of 9 categories, ensuring diversity in instruction-following tasks

Data Sample

1/6

Category:

Generation - General Text Creation

# of prompts:

385

Description:

Tasks involve creating original content such as text messages, recipes, jokes, and essays.

User

Design a seven-day food plan that is both varied and nutrient-dense, including three vegetarian meals per day. Make sure that every meal has a high protein level and includes a minimum of one fruit or vegetable. Furthermore, make sure that one of the meals has a serving of dairy each day, add whole grains to another, and keep lentils and tomatoes out of every meal. The complexity is increased by forbidding meal recurrence during the week. This complex request calls for diversity, balance, and certain exclusions in addition to meeting a vegetarian, high-protein focus for a satisfying meal.

ifDistributionDark.png
Breakdown of Categories (Percentage)

Prompt Dataset Construction

We assembled a diverse team of annotators with backgrounds in linguistics, cognitive psychology, and specific fields such as education, mathematics, and STEM, whom we think are best equipped for this task. Their diverse background was meant to test models' ability to understand and execute instructions across different fields.

To construct the dataset, we followed these steps:

  1. Content creation control: The annotators were barred from using public resources or LLMs in order to ensure the production of original human-generated prompts.
  2. Initial attempts: The team of human annotators generated a set of more than 2,500 prompts, aimed at covering the 9 categories of instruction-following scope.
  3. Full Set Review Process: The full initial set underwent multiple review stages, including qualitative analysis, automated grammar checks and final review by trusted reviewers. This refined the set down to 1.5k prompts.
  4. Final Audit (10% sample): An internal team of independent auditors conducted a final quality review on random samples of tasks; this resulted in a final review process, with the removal of any task scoring below 4-5 likert ratings, resulting in 1,054 finalized prompt-response pairs.

This dataset is designed for research into LLMs' ability to follow detailed instructions, providing a resource for benchmarking and improvement.

Quality Controls for Prompts

Based on the evaluation of prompt effectiveness, we formulated guidelines for the annotators, emphasizing clarity, complexity, and specificity. A high-quality prompt includes distinct, non-repetitive elements that direct the model to a precise outcome, eliminating the possibility of vague or generic responses. Effective prompts define clear goals with specific conditions or constraints, challenging the model to employ deep reasoning and problem-solving skills.

In contrast, less effective prompts lack the specificity or challenge necessary to extend the model’s capabilities beyond basic tasks. Our screening process therefore excludes prompts that are easily solvable via simple internet searches or do not require sophisticated AI responses.

To maintain quality we implemented a multi-stage review pipeline: each prompt underwent 2 expert reviews to ensure adherence to instructions. An internal team of independent auditors conducted a final quality review, correcting or discarding low-quality entries.

Evaluation Taxonomy

To capture a nuanced assessment, we created an evaluation taxonomy specific to precise instructions following tasks. Each model response was evaluated across a set of stand-alone criteria, covering each of the use cases, and side-by-side with another model response to measure preference ranking on a 7-point likert scale.

ifTableSolo1Dark3.pngifTableSolo2Dark2.png

The above dimensions are broken down into 12 criteria that are rated with a Yes or No score.

After the stand-alone evaluation, responses are compared side-by-side using a Likert scale. This comparative assessment helps in identifying the preferable model response based on a detailed justification tied to the evaluation criteria.

ifEvalCriteriaDark.png

Evaluation Methodology

Each model is paired with every other model at least 50 times, and each pairing receives a randomly chosen prompt from the set of 1,054 prompts described above.

Each evaluation tasks consists of the following:

  1. Two models generate the responses for a prompt
  2. Annotators provide a point-wise evaluation of each response
  3. Annotators express their preference between the two scores on a 7-point likert scale

To ensure thoroughness and reliability in the evaluation process, each task was executed in parallel 3 times by different human annotators. Then, the ratings were then reviewed in two stages: an initial review layer and a final review layer. The figure below provides an overview of the evaluation pipeline design. After finalizing the tasks, a team of internal independent auditors randomly selected and reviewed 10% of the tasks for quality control.

Evaluation Methodology - Pipeline Design
pipelineDark5.png

Evaluation Insights Summary

Alternative Rankings

During our evaluation process, we collected both pairwise rankings and ratings based on individual model responses. For the main leaderboard, we report aggregated instruction-following scores, averaging the “prompt adherence” and “relevance” fields, to focus on the models' instruction-following abilities. We consider a response as following the instructions if it “answers what the prompt requests without violating explicit constraints while not adding irrelevant content”.

In this section, we detail how the rankings would vary if we used the Bradley-Terry scores in Elo-scale or only considered the writing, or honesty/factuality fields. Think of the Bradley-Terry scores ranking as an overall preference ranking, and we deep dive into other important dimensions such as writing and honesty/factuality to better understand why the Bradley-Terry scores and pure instruction following abilities differ.

Bradley-Terry scores Leaderboard


In addition to instruction following ratings, we also collected pairwise rankings measuring the annotators’ preference between two model responses on these instruction following prompts. This ranking also takes into considerations outside of instruction such as factuality and the general preference on style and writing quality. For example, we notice that the Gemini 1.5 Pro (post-I/O) model’s rankings move up here, possibly due to its higher ranking in “writing” that’s not taken into account in the pure instruction following leaderboard.

IfEloTableDark2.png

Writing Leaderboard

writingLeaderDark4.png

Instruction following “Honesty” Leaderboard


Similar to the instruction-following ratings, we also collect an "honesty/factuality" rating to indicate if the model generated any unfactual statements in its response.

Although this is not a comprehensive factuality leaderboard, as it only considers factuality within the context of instruction-following queries, we believe it still provides valuable insights into different model behaviors for the community.

Our findings indicate that GPT-4 Turbo Preview (from January) is the most factual model, followed by Claude 3 Opus and Gemini 1.5 Pro (from May). While Claude 3 Opus only ranked #5 in the instruction following rankings, it demonstrates significantly fewer hallucinations compared to all other models except for GPT 4 Turbo.

honestleaderDark4.png

In addition to the alternative rankings, we also provide detailed performance reports for each model across all our subcategories of ratings. We highlight the top strengths and weaknesses for each model on the leaderboard. For more details, please refer to Appendix A.

Acknowledgments

This project was made possible by the dedicated efforts of a team of expert annotators. We extend our gratitude to everyone involved in the development and refinement of the dataset and the verification methodology.

Scale AI Team: Ernesto Hernandez*, Mike Lunati*, Dean Lee, Cristina Menghini, Diego Mares, Daniel Berrios, William Qian, Kenneth Murphy, Summer Yue

Appendix A - Evaluation insights

Models across all criteria

ifAllCriteriaDark.png
Models Relative Strengths and Weaknesses on Standalone Criteria (models are not ordered)
ifAllCriteriaTableDark.png

  1. The three lowest rated criteria are Constraints, Prompt Request Coverage and Accuracy which are the main drivers for the Instruction Following and Honesty/Factuality dimensions.
    1. The three best performing models for Constraints are Llama 3 70B Instruct, Claude 3 Opus and GPT-4o (As shown in the main Leaderboard)
    2. The three best performing for Prompt request Coverage are GPT-4o, GPT-4 Turbo Preview and Llama 3 70B Instruct
    3. For Accuracy the top three are GPT4-Turbo-Preview, Claude 3 Opus and Gemini 1.5 Pro (post-I/O)

Instruction following “Prompt Adherence” without “Relevance” Leaderboard

In the main leaderboard, we report the aggregated instruction-following scores, averaging the “prompt adherence” and “relevance” fields. We consider a response as following the instructions if it “answers what the prompt requests without violating explicit constraints while not adding irrelevant content”.

In this appendix, we also report the alternative ranking if we only consider the “prompt adherence” abilities without considering “relevance”. “Adherence” here is defined as the annotator answering yes to both “Prompt Adherence - Prompt Request Coverage” and “Prompt Adherence - Constraints”. In another word, in this version, we do not penalize the model for adding irrelevant content to what the prompt is asking.

noRelevanceLeaderDark3.png

Constraints and Accuracy per Category breakdown

CONSTRAINTS CRITERIA
Standard deviation analysis

  1. Least spread (indicates consistent model performance across use cases):
    1. Claude 3 Opus
  2. Widest spread (suggests strong performance in certain use cases but poor performance in others):
    1. Llama 3 70B Instruct

uploads/6385499301a14a003c48e451/ifConstraintHeatDark.png

ACCURACY CRITERIA
Standard deviation analysis

  1. Least spread (indicates consistent model performance across use cases):
    1. GPT4-Turbo-2024-040 & Claude 3 Opus
  2. Widest spread (suggests strong performance in certain use cases but poor performance in others):
    1. CodeLlama 34B Instruct

ifAccuracyHeatDark.png
Model
Score95% Confidence
88.57
+1.53/-1.56
87.64
+1.46/-1.44
85.55
+1.85/-1.85
85.34
+1.66/-1.64
84.82
+1.68/-1.72
84.51
+1.89/-1.91
83.18
+1.82/-1.78
82.90
+2.00/-2.00
82.36
+2.04/-1.96
73.57
+2.32/-2.37
66.77
+2.12/-2.17