Introduction

The Scale AI Coding Prompts Set comprises 1,000 prompts spanning a diverse array of programming languages, disciplines, and programming tasks. This dataset encompasses a broad spectrum of software engineering tasks, ranging from debugging to code optimization and from documentation generation to understanding complex code bases.

While the general use and understanding of LLMs’ usage for coding applications has grown, there are limited tools or benchmarks available to compare on a like-for-like basis different models. The most well-known ones include:

  1. HumanEval Dataset: A set of 164 handwritten programming problems assessing language comprehension, algorithms, and simple mathematics.
  2. Pass@k Metric: Defined as the probability that at least one of the top k-generated code samples for a problem passes unit testing; it evaluates the functional correctness of generated code samples.
  3. MBPP: The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers, and is designed to measure the ability of LLMs to synthesize short Python programs from natural language descriptions.
  4. SWE-Bench (Software Engineering Benchmark): A benchmark testing if LLMs can solve real world issues sourced from Github issues and pull requests.
  5. LiveCodeBench: A collection of programming puzzles from recent competitive coding competitions designed to benchmark advanced reasoning skills while mitigating effects of data contamination.

While these evaluation benchmarks were useful when they first came out, as models started to overfit them, their application has become less valuable. Multiple coding models fine tuned from GPT-4 pre-trained models are scoring above 85 on the Pass@1 metric. Additionally, the methodology via which these models are evaluated against these benchmarks are often non-standardized, lacking a core requirement for comparing scores across tests or over time.

The goal of the Scale AI coding evaluation is to establish a uniform framework for evaluating LLMs’ coding capabilities.

Prompts Dataset Description

The Scale Coding Evaluation provides a standardized assessment framework for LLMs, enabling comparisons across models and identifying their strengths and weaknesses. It currently encompasses a set of use cases across the most requested coding languages.

Data Sample

1/10

Use Case:

Generation

Description:

Create new code from a set of specifications or descriptions given in natural language.

# of Prompts:

207 (20.7%)

User

I want you to make a program in Python. This program will predict how long a robot vacuum takes to clean a room. The inputs to the program are the size of the room in square meters and battery level as a percentage. The base cleaning time is 30 minutes for a 20-square-meter room. But it takes two more minutes for each extra square meter over 20. So, a bigger room means a longer cleaning time. The battery level also affects the cleaning time. If the battery isn't at 100%, the vacuum works slower. For every 1%, the battery is below 100%, and the cleaning takes 0.5% longer. For example, a room size of 50 square meters and a battery of 75% are used. Explain how to calculate the predicted cleaning time for those numbers.

Use Cases Distribution
codingUseCaseGraphDark2.png
Coding Languages Distribution
codingLanguageGraphDark2.png

The dataset was created by a group of coding annotators screened and selected for the project. They were selected on the basis of software development / programming / data science experience and qualifications, ensuring coverage of all required coding languages.

Dataset constructions steps:

  1. Instructions creation. We generated an instruction document to guide human annotators contributors on how to generate the initial set of prompt-response pairs.
  2. Controlled access. Contributors were prohibited from using any publicly available code or questions, including those from StackOverflow or public GitHub repositories. This prevents possible eval set contamination, ensuring the creation of novel problems the models haven’t seen in their training sets.
  3. Initial attempts. Contributors were guided to create initial attempts, an initial prompt-response set, following a predetermined distribution by coding language and programming task.
  4. Quality Control. We designed a 2-stage pipeline:
    1. Based on initial attempts’ quality, the operators identified a set of trusted reviewers
    2. Trusted reviewers performed multiple review rounds to evaluate the prompt-response pairs accuracy, coherence, and relevance.
    3. In addition to standard reviews, code execution tests on the model answers and independent audits were performed. We filtered down the initial set of 2.5k prompts to 1.5k reviewed prompts-responses.
  5. Independent final audit. The team commissioned a set of internal independent auditors with a final review of the quality of the prompts and answers. Low quality prompts or answers were filtered out. This final screen reduced the initial set down to 1,000 quality-validated prompt-response pairs.

Evaluation Taxonomy

To capture a nuanced assessment, we created an evaluation taxonomy specific to coding tasks. Each model response was evaluated across a set of standalone criteria, covering each of the use cases, and side-by-side with another model response to measure preference ranking on a 7-point likert scale:

  1. Prompt Adherence: Does the model strictly adhere to the given prompt and comprehend all its requirements?
    1. Ratings: Yes, No
  2. Correctness:
    1. To what extent are the claims in the response truthful and correct? Verifying Correctness often requires external research.
    2. If code was present in the response, does it execute and produce the correct output?
      1. Evaluators are encouraged to use any means possible to test code (e.g., writing simple programs to test functions and code snippets), however, Code Correctness may not be measured if, for example, the code only functions when embedded inside a large, complex program that is not provided, or if it requires an external file/API dependency that is not provided.
    3. Ratings: Yes, No
  3. Performance / Efficiency: Does the code execute without any performance concerns?
    1. Ratings: Yes, No
  4. Readability / Documentation: Is the written explanation well-structured and visually organized? Does the response include necessary documentation aiding in code understanding? Is the code readable, employing proper formatting and mnemonic variable and function names?
    1. Ratings: Yes, No
  5. Overall Side-by-Side: Considering the dimensions above relevant to the specific task, how does the overall quality of the two model responses compare?
    1. Ratings: Score between 1 to 7, where 1 signifies that model 1’s response was much better than model 2’s response and 7 signifies the opposite.

Expert raters were required to provide further detailed justifications regarding (1) any model issues identified if they selected “No” in any of the above dimensions, and (2) the side-by-side rating, in an open-text format. Additionally, for all the responses that involve a nontrivial code snippet, we ask the annotators to compile and run the code in our tooling in order to fully catch compilation and runtime issues.

Evaluation Methodology

In our evaluation, each model is paired with every other model at least 50 times, and each pairing receives a randomly chosen prompt from the set of 1,000 prompts described above.

Each evaluation tasks consists of the following:

  1. Two models generate the responses for a prompt
  2. Annotators provide a point-wise evaluation of each response
  3. Annotators express their preference between the two scores on a 7-point likert scale

To ensure thoroughness and reliability in the evaluation process, each task was executed in parallel 3 times by different human annotators. Then, the ratings were then reviewed in two stages: an initial review layer and a final review layer. The figure below provides an overview of the evaluation pipeline design. After finalizing the tasks, a team of internal independent auditors randomly selected and reviewed 10% of the tasks for quality control.

pipelineDark5.png
Evaluation Methodology - Pipeline Design

Beyond producing overall ranking, this evaluation methodology enables slicing of the evaluation data by programming languages and use cases, to help highlight models’ strengths and weaknesses across different areas, to help answer questions like: how does a model perform compared to the reference model on SQL, Java, HTML/CSS, and C++ prompts? How competitive is a particular model in specific functions, from less code-intensive tasks like Understanding and Recommendations, to complex scenarios like Translation?

Evaluation Insights Summary

  1. Models typically perform well in "Commenting" tasks and "Understanding" tasks, but they often face difficulties with "Translation" tasks and "Generation" tasks.
  2. “Correctness / Functionality” and “Readability / Documentation” are the two primary sources of error across models.
  3. Model version comparison:
    1. GPT (gpt-4-0125-preview, gpt-4o-2024-05-13):
      1. The two GPT models demonstrate the most consistent performance across use cases.
      2. Although both GPT4 models are top-ranked on the leaderboard, the newer model (gpt-4o-2024-05-13) tends to exhibit more readability issues than gpt-4-0125-preview, occasionally repeating code from the prompt unnecessarily, which could result in more verbose responses.
    2. Gemini (gemini-1.5-pro-preview-0514, gemini-1.5-flash-preview-0514, gemini-1.5-pro-preview-0409):
      1. All three Gemini models excelled in the Recommendation task, ranking #1, #2, and #3 respectively, but they struggled more with testing tasks.
      2. The new Gemini model (gemini-1.5-pro-preview-0514) shows significant improvements over the previous version (gemini-1.5-pro-preview-0409), especially on “Correctness / Functionality” and “Readability / Documentation”.
    3. Claude (Claude-3-opus-20240229, claude-3-sonnet-20240229):
      1. claude-3-opus-20240229 generally outperforms claude-3-sonnet-20240229, particularly in making fewer errors in the "Correctness/Functionality" category, except for certain Translation tasks.

Top 6 model insight highlights:

codingInsightsDark4.png

*4 evaluation dimensions are Prompt adherence / Understanding, Correctness / Functionality, Performance / Efficiency, Readability / Documentation
* Note: The strengths and weaknesses mentioned are relative to other models.

Please refer to Appendix B for more detailed analysis beyond this high level summary.

Acknowledgments

We extend our deepest gratitude to the dedicated team of annotators, operators and researchers who made this project possible.

Scale AI team: Dean Lee*, Mike Lunati, Cristina Menghini, Daniel Berrios, William Qian, Kenneth Murphy, Summer Yue

Appendix A - Prompt Diversity

Controlling command diversity is essential for model evaluation because doing so exercises the versatility and comprehensiveness of the target models in handling a wide range of instructions or directives. A model with high command diversity can effectively understand and respond to various types of commands, including different task-specific instructions, requests, prompts for information, or actions to be performed. During prompt collection, frequent combinations like “create-program” were caught by automated checks and discouraged in favor of different wordings with similar meanings. To analyze command diversity for coding prompts, all code is removed such that only the natural language remains.

cropped_pie_chart_transparent.png

Appendix B - Evaluation Insights

Model performance overview

Models typically perform well in "Commenting" tasks (58% overall correctness) and "Understanding" tasks (51% overall correctness), but they often face difficulties with "Translation" tasks and "Generation" tasks (both with 30% overall correctness). (See "Data Samples" for each use case)

“Correctness / Functionality” and “Readability / Documentation” are the two primary sources of error across models.

codingCorrectHeatDark2.png

Performance consistency analysis (Coefficient of variation)

  1. Lowest CV (indicates consistent model performance across use cases):
    1. gpt-4-0125-preview
    2. gpt-4o-2024-05-13
  2. Highest CV (suggests strong performance in certain use cases but poor performance in others):
    1. codellama-34b-instruct
    2. gemini-1.0-pro-001

codingPerfHeatDark3.png

Model version comparison

gpt-4-0125-preview vs. gpt-4o-2024-05-13


Although both GPT4 models are top-ranked on the leaderboard, the newer model (gpt-4o-2024-05-13) tends to exhibit more readability issues than gpt-4-0125-preview, occasionally repeating code from the prompt unnecessarily, which could result in more verbose responses.

The most significant performance decline in the gpt-4o-2024-05-13 is observed in Translation tasks, especially in terms of “Prompt adherence / Understanding” (error rate increased from 6% to 16%) and “Correctness / Functionality” (error rate increased from 14% to 32%).

codingGptCompHeatDark6.png

Example
The example below demonstrates why gpt-4-0125-preview is preferred over gpt-4o-2024-05-13, pointing out the correctness and readability issues of the gpt-4o-2024-05-13.

Justification When GPT 4 0125 Preview Wins
"Gpt-4-0125-preview outperforms gpt-4o-2024-05-13 because it adheres perfectly to the prompt instructions and produces error-free text and code. In contrast, while gpt-4o-2024-05-13 also adheres to the prompt, it has issues with correctness and readability. Specifically, gpt-4o-2024-05-13 inaccurately states that student "Hannah" qualifies for "Leadership" with a 3.5 GPA, conflicting with the criteria in the CTE, which requires a 3.6 GPA or higher. Furthermore, the gpt-4o-2024-05-13 model repeats the prompted SQL code in its output unnecessarily, making it verbose and reducing readability. Similarly, the explanation of the sample data was redundant. The inserted data is descriptive and self-explanatory.”

Example

1/3

Prompt

The following SQL query is produced to assign free training to students. The eligibility of students to enroll in each training is determined based on their GPA and department. Students who enrolled in Technology and Engineering with a GPA greater than or equal to 3.5 are eligible to enroll in 'Basic Java Tutorial' training, while "Science" students are eligible for 'Space Science" if their GPA is at least 3.5. Lastly 'Management' students are eligible for "Leadership" Training. Can you please generate sample data for both tables for testing the query?

SQL
1CREATE TABLE Students( 2 StudentID INT PRIMARY KEY, 3 Name VARCHAR (255), 4 Department VARCHAR(255), 5 GPA DECIMAL(3,2) 6); 7 8CREATE TABLE Trainings( 9 TrainingID INT PRIMARY KEY, 10 TrainingName VARCHAR (255) 11); 12 13WITH Candidates AS( 14 SELECT 15 s.StudentID, 16 s.Name, 17 s.Department, 18 s.GPA, 19 CASE 20 WHEN s.GPA>=3.5 AND s.Department IN ('Engineering', 'Technology') THEN 'Basic Java Tutorial' 21 WHEN S.GPA>= 3.5 AND s.Department = 'Science' THEN 'Space Science' 22 WHEN s.GPA>=3.6 AND s.Department = 'Management' THEN 'Leadership' 23 ELSE 'Not Eligible for Training' 24 END AS TrainingName 25 FROM Students s 26 ), 27 28Ranking AS( 29 SELECT *,RANK() OVER(ORDER BY GPA DESC) Ranking 30 FROM Candidates 31 ) 32SELECT * 33FROM Ranking 34ORDER BY Ranking;

gemini-1.5-pro-preview-0409 vs. gemini-1.5-pro-preview-0514


The new Gemini model (gemini-1.5-pro-preview-0514) shows significant improvements over the previous version (gemini-1.5-pro-preview-0409) across all use cases except for recommendation, where the previous model excels. Overall correctness increased from 39% to 52%.

In terms of error dimensions, the new model improved most in “Correctness / Functionality” (error rate dropped from 35% to 25%) and “Readability / Documentation” (error rate dropped from 30% to 24%).

codingGeminitableDark2.pngcodingGeminiHeatDark2.png

Example
The example below demonstrates why gemini-1.5-pro-preview-0514 is preferred over gemini-1.5-pro-preview-0409, pointing out the correctness and readability issues of the gemini-1.5-pro-preview-0409.

Justification When Gemini-1.5-pro-preview-0514 Wins
“Gemini-1.5-Pro-Preview-0409 has a minor issue with the CSS of the .nav element. The original parent .nav styling was wrongly removed. This removal adversely affects the style, particularly the margins and padding, leading to an unclear display and improper styling from the user's perspective. The headers are hard to distinguish as the white background overlaps. Additionally, the website is not responsive and does not resize correctly with changes in the viewport. The issue is the fixed width of the .container class, which is set to 960px. Changing this to max-width: 960px, as seen in R2, would make the website responsive. Lastly, the CSS code is missing comments, which reduces readability.

In contrast, Gemini-1.5-Pro-Preview-0514 offers a more structured approach. It enhances accessibility by including the lang attribute and demonstrates responsive design by effectively using Flexbox. While Response 1 also used Flexbox, it was wrongly implemented. The adjustments to padding and margins contribute to a polished visual appearance, reflecting a user interface design that is particularly suitable for modern contexts. Its code is well-organized, with clear explanations and commented lines indicating edits, enhancing readability.”

Example

1/3

Prompt

The below code includes inline CSS styles, non-semantic HTML markup, redundant CSS properties, and inefficient layout techniques. It's not optimized for performance, maintainability, or scalability, but it serves as an example of how code might look if not carefully designed and structured.

Provide an optimized code for the below HTML code:

HTML
1<!DOCTYPE html> 2<html> 3<head> 4 <title> HTML and CSS</title> 5 <style> 6 /* CSS code */ 7 body { 8 background-color: #ffffff; 9 font-family: Arial, sans-serif; 10 } 11 12 .container { 13 width: 960px; 14 margin: 0 auto; 15 padding: 20px; 16 } 17 18 .header { 19 background-color: #333333; 20 color: #ffffff; 21 padding: 20px; 22 } 23 24 .nav { 25 background-color: #cccccc; 26 padding: 10px; 27 margin-top: 20px; 28 } 29 30 .nav ul { 31 list-style-type: none; 32 padding: 0; 33 margin: 0; 34 } 35 36 .nav ul li { 37 display: inline; 38 margin-right: 10px; 39 } 40 41 .nav ul li a { 42 text-decoration: none; 43 color: #333333; 44 padding: 5px 10px; 45 background-color: #ffffff; 46 } 47 48 .nav ul li a:hover { 49 background-color: #666666; 50 color: #ffffff; 51 } 52 53 .sidebar { 54 float: left; 55 width: 250px; 56 background-color: #cccccc; 57 padding: 20px; 58 margin-top: 20px; 59 } 60 61 .content { 62 float: left; 63 width: 660px; 64 background-color: #eeeeee; 65 padding: 20px; 66 margin-top: 20px; 67 } 68 69 .footer { 70 clear: both; 71 background-color: #333333; 72 color: #ffffff; 73 padding: 20px; 74 text-align: center; 75 } 76 </style> 77</head> 78<body> 79 <!-- HTML code --> 80 <div class="container"> 81 <div class="header"> 82 <h1> Website</h1> 83 </div> 84 <div class="nav"> 85 <ul> 86 <li><a href="#">Home</a></li> 87 <li><a href="#">About</a></li> 88 <li><a href="#">Services</a></li> 89 <li><a href="#">Contact</a></li> 90 </ul> 91 </div> 92 <div class="sidebar"> 93 <h2>Sidebar</h2> 94 <p>This is the sidebar content.</p> 95 </div> 96 <div class="content"> 97 <h2>Main Content</h2> 98 <p>This is the main content area.</p> 99 </div> 100 <div class="footer"> 101 <p> Website</p> 102 </div> 103 </div> 104</body> 105</html>

Claude-3-opus-20240229 vs. claude-3-sonnet-20240229


claude-3-opus-20240229 generally outperforms claude-3-sonnet-20240229, particularly in making fewer errors in the "Correctness/Functionality" category (error rate dropped from 36% to 27%), except for certain Translation tasks.

codingclaudeTableDark2.pngcodingClaudeHeatDark3.png

Example
The example below demonstrates why claude-3-opus-20240229 is preferred over claude-3-sonnet-20240229, pointing out the correctness issues of the claude-3-sonnet-20240229.

Justification When Claude-3-opus-20240229 Wins
“Opus is the better model because it produces the correct output and adheres to the prompt instructions, which specify iterating until 10. In contrast, Sonnet stops after the first iteration. Opus uses a CTE to generate numbers from 2 to 9. It employs a 'NOT EXISTS' subquery to check for any divisors smaller than the current number that divide it evenly, effectively selecting prime numbers from the 'prime_numbers' CTE and renaming the column to 'prime_number'.

Example

1/3

Prompt

Create a sql query which has the functionality of loops like other programming languages, Do 10 iterations of the loop and print the prime numbers till 10.

Specific model insight

Model strength examples

codingStrengthTableDark4.png

Model weakness examples

codingWeakenesstableDark3.png

Win / Loss visualizations

codingWL.pngcodingBS.png
Model
Score95% Confidence
1155
+21/-24
1144
+31/-32
1112
+27/-32
1071
+28/-25
1057
+25/-30
1010
+25/-26
1010
+22/-24
997
+26/-25
930
+23/-28
804
+38/-29
716
+24/-30