scale logo
<- Back to leaderboard

Instruction Following

Introduction

The Precise Instruction Following Prompts Dataset is composed of 1,054 instruction following prompts aimed at assessing the ability of AI models to interpret and execute detailed commands, focusing on precision and specificity.

A popular method for assessing LLMs on instruction following tasks is the IFEval benchmark, which focuses on evaluating LLMs using prompts containing programmatically verifiable instructions. However, scenarios in this benchmark are limited due to the requirement of being automatically evaluable. Additionally, similar to other open source benchmarks, IFEval is prone to overfitting.

To address these limitations, we built the Scale AI Precise Instruction Following Prompts Dataset. This is a set of private instruction-following prompts, intended to be paired with human evaluations. This dataset includes 1,054 instruction following prompts grouped in 9 categories, including “act as if”, content creation and brainstorming, and covering real applications and use cases for instructions following tasks. It was generated by a diverse group of over 40 human annotators and developed through a five-step process to ensure the final prompts tested the model’s capability to understand and execute instructions with specificity. The ultimate intent is to run human evaluations on models’ responses to this prompt set.

Dataset Description

The dataset comprises 1,054 prompts designed for single-turn instruction following, intended to evaluate nuanced directive comprehension by the model. It tests the model's ability to execute complex instructions with clarity and specificity.

The construction of this dataset posed challenges in maintaining prompt diversity and specificity, addressed by a review process to ensure each prompt's uniqueness. The dataset covers a broad spectrum of 9 categories, ensuring diversity in instruction-following tasks

Category
Definition
# of prompts
Generation - General Text Creation
Tasks involve creating original content such as text messages, recipes, jokes, and essays.
385
Generation - Content Creation
Subdivided into poetry, narrative fiction, social media posts and other.
232
Generation - "Act as if…"
Tasks require responses from specific personas, enhancing the AI's adaptability and creative output.
89
Brainstorming - Short answers
Output a short list within 3-5 items, explicitly asking for a concise enumeration of ideas or options.
97
Brainstorming - Long answers
Require a longer list within 15-20 items, focusing on breadth and inclusivity of ideas or options, with each item being succinct.
96
Brainstorming - With a format
Prompts specify the output formatting, such as dashes, bullet points, or enumeration, guiding the structure of the response.
97

Data Sample

1/6

Category:

Generation - General Text Creation

# of prompts:

385

Definition:

Tasks involve creating original content such as text messages, recipes, jokes, and essays.

person

User

Design a seven-day food plan that is both varied and nutrient-dense, including three vegetarian meals per day. Make sure that every meal has a high protein level and includes a minimum of one fruit or vegetable. Furthermore, make sure that one of the meals has a serving of dairy each day, add whole grains to another, and keep lentils and tomatoes out of every meal.

Breakdown of Categories (Percentage)

Prompt Dataset Construction

We assembled a diverse team of annotators with backgrounds in linguistics, cognitive psychology, and specific fields such as education and STEM, whom we think are best equipped for this task. Their diverse background was meant to test models' ability to understand and execute instructions across different fields.

To construct the dataset, we followed these steps:

  1. Content creation control: The annotators were barred from using public resources or LLMs in order to ensure the production of original human-generated prompts.
  2. Initial attempts: The team of human annotators generated a set of more than 2,500 prompts, aimed at covering the 9 categories of instruction-following scope.
  3. Full Set Review Process: The full initial set underwent multiple review stages, including qualitative analysis, automated grammar checks and final review by trusted reviewers. This refined the set down to 1.5k prompts.
  4. Final Audit (10% sample): An internal team of independent auditors conducted a final quality review on random samples of tasks; this resulted in a final review process, with the removal of any task scoring below 4-5 likert ratings, resulting in 1,054 finalized prompt-response pairs.

This dataset is designed for research into LLMs' ability to follow detailed instructions, providing a resource for benchmarking and improvement.

Quality Controls for Prompts

Based on the evaluation of prompt effectiveness, we formulated guidelines for the annotators, emphasizing clarity, complexity, and specificity. A high-quality prompt includes distinct, non-repetitive elements that direct the model to a precise outcome, eliminating the possibility of vague or generic responses. Effective prompts define clear goals with specific conditions or constraints, challenging the model to employ deep reasoning and problem-solving skills.

In contrast, less effective prompts lack the specificity or challenge necessary to extend the model’s capabilities beyond basic tasks. Our screening process therefore excludes prompts that are easily solvable via simple internet searches or do not require sophisticated AI responses.

To maintain quality we implemented a multi-stage review pipeline: each prompt underwent 2 expert reviews to ensure adherence to instructions. An internal team of independent auditors conducted a final quality review, correcting or discarding low-quality entries.

Evaluation Taxonomy

To capture a nuanced assessment, we created an evaluation taxonomy specific to precise instruction following tasks. Each model response was evaluated across a set of stand-alone criteria, covering each of the use cases, and side-by-side with another model response to measure preference ranking on a 7-point likert scale.

The above dimensions are broken down into 12 criteria that are rated with a Yes or No score.

After the stand-alone evaluation, responses are compared side-by-side using a Likert scale. This comparative assessment helps in identifying the preferable model response based on a detailed justification tied to the evaluation criteria.

Evaluation Methodology

Each model is paired with every other model at least 50 times, and each pairing receives a randomly chosen prompt from the set of 1,054 prompts described above.

Each evaluation tasks consists of the following:

  1. Two models generate the responses for a prompt
  2. Annotators provide a point-wise evaluation of each response
  3. Annotators express their preference between the two scores on a 7-point likert scale

To ensure thoroughness and reliability in the evaluation process, each task was executed in parallel 3 times by different human annotators. Then, the ratings were then reviewed in two stages: an initial review layer and a final review layer. The figure below provides an overview of the evaluation pipeline design. After finalizing the tasks, a team of internal independent auditors randomly selected and reviewed 10% of the tasks for quality control.

Evaluation Methodology - Pipeline Design

Evaluation Insights Summary

Alternative Rankings

During our evaluation process, we collected both pairwise rankings and ratings based on individual model responses. For the main leaderboard, we report aggregated instruction-following scores, averaging the “prompt adherence” and “relevance” fields, to focus on the models' instruction-following abilities. We consider a response as following the instructions if it “answers what the prompt requests without violating explicit constraints while not adding irrelevant content”.

In this section, we detail how the rankings would vary if we used the Bradley-Terry scores in Elo-scale or only considered the writing, or honesty/factuality fields. Think of the Bradley-Terry scores ranking as an overall preference ranking, and we deep dive into other important dimensions such as writing and honesty/factuality to better understand why the Bradley-Terry scores and pure instruction following abilities differ.

Bradley-Terry scores Leaderboard


In addition to instruction following ratings, we also collected pairwise rankings measuring the annotators’ preference between two model responses on these instruction following prompts. This ranking also takes into considerations outside of instruction such as factuality and the general preference on style and writing quality. For example, we notice that the Gemini 1.5 Pro (August 27, 2024) model’s rankings move up here, possibly due to its higher ranking in “writing” which is not taken into account in the pure instruction following leaderboard. O1-preview ranks first while Llama 3.2 90B Vision Instruct ranks tenth.

Writing Leaderboard

Instruction following “Honesty” Leaderboard


Similar to the instruction-following ratings, we also collect an "honesty/factuality" rating to indicate if the model generated any non-factual statements in its response.

Although this is not a comprehensive factuality leaderboard, as it only considers factuality within the context of instruction-following queries, we believe it still provides valuable insights into different model behaviors for the community.

Our findings indicate that GPT-4 Turbo Preview is the most factual model, followed by Gemini 1.5 Pro (May 2024) and o1-preview.

In addition to the alternative rankings, we also provide detailed performance reports for each model across all our subcategories of ratings. We highlight the top strengths and weaknesses for each model on the leaderboard. For more details, please refer to Appendix A.

Acknowledgments

This project was made possible by the dedicated efforts of a team of expert annotators. We extend our gratitude to everyone involved in the development and refinement of the dataset and the verification methodology.

Scale AI Team: Ernesto Hernandez*, Mike Lunati*, Dean Lee, Cristina Menghini, Diego Mares, Daniel Berrios, William Qian, Kenneth Murphy, Summer Yue, Darwin Hsu, David Guevara

Appendix A - Evaluation insights

Models across all criteria

Models Relative Strengths and Weaknesses on Standalone Criteria (models are not ordered)

  1. The three lowest rated criteria are Constraints Fulfillment, Claim Factuality and Main Request Fulfillment. Constraint Fulfillment and Main Request Fulfillment are the main drivers for Instruction Following, and Constraint Fulfillment is the main component of the Factual Accuracy dimension.
    1. When it comes to constraint fulfillment, Claude 3.5 Sonnet is the top performer, alongisde Llama 3 70B Instruct and GPT-4o (May 2024).
    2. For Claim Factuality the top three models are GPT-4 Turbo Preview, Gemini 1.5 Pro (May 2024) and o1-preview.
    3. The three best performing models for Main Request Fulfillment are o1-preview, followed by Llama 3.2 90B Vision Instruct and Gemini 1.5 Pro (August 27 2024).

Instruction following “Prompt Adherence” without “Relevance” Leaderboard

In the main leaderboard, we report the aggregated instruction-following scores, averaging the “prompt adherence” and “relevance” fields. We consider a response as following the instructions if it “answers what the prompt requests without violating explicit constraints while not adding irrelevant content”.

In this appendix, we also report the alternative ranking if we only consider the “prompt adherence” abilities without considering “relevance”. “Adherence” here is defined as the annotator answering yes to both “Prompt Adherence Main Request Fulfillment” and “Prompt Adherence - Constraint Fulfillment”. In other words, in this version, we do not penalize the model for adding irrelevant content to what the prompt is asking.

Constraints and Accuracy per Category Breakdown

CONSTRAINTS CRITERIA
Standard deviation analysis

  1. Least spread (indicates consistent model performance across use cases):
    1. GPT-4o (August 2024)
  2. Widest spread (suggests strong performance in certain use cases but poor performance in others):
    1. Llama 3 70B Instruct

ACCURACY CRITERIA
Standard deviation analysis

  1. Least spread (indicates consistent model performance across use cases):
    1. GPT-4 Turbo Preview
  2. Widest spread (suggests strong performance in certain use cases but poor performance in others):
    1. Llama 3.2 90B Vision Instruct.

Model
Score95% Confidence
87.32
+1.71/-1.71
87.09
+1.51/-1.52
86.01
+1.54/-1.53
85.29
+1.61/-1.61
85.09
+1.83/-1.83
84.63
+1.81/-1.82
83.87
+1.42/-1.43
83.72
+1.88/-1.88
81.85
+1.96/-1.96
81.32
+1.75/-1.75
80.77
+1.84/-1.83
80.49
+1.72/-1.72
80.03
+1.57/-1.58
78.52
+2.33/-2.32
78.24
+2.19/-2.19
77.25
+1.96/-1.97
67.97
+2.61/-2.62
57.69
+2.58/-2.57