Coding
Introduction
The Scale Coding Evaluation consists of 1,000 prompts covering various programming languages, fields, and tasks. This comprehensive dataset spans a wide range of software engineering challenges, including debugging, code optimization, documentation creation, and analysis of complex codebases.
While the general use and understanding of LLMs’ usage for coding applications has grown, there are limited tools or benchmarks available to compare different models. The most well-known benchmarks include:
- HumanEval Dataset: 164 handcrafted programming problems testing language comprehension, algorithms, and simple mathematics.
- Pass@k Metric: Assesses the likelihood that at least one of the top k generated code samples for a problem passes unit tests, evaluating functional correctness.
- MBPP: The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers, and is designed to measure the ability of LLMs to synthesize short Python programs from natural language descriptions.
- SWE-Bench (Software Engineering Benchmark):Tests LLMs' capability to address real-world issues sourced from GitHub issues and pull requests.
- LiveCodeBench: A collection of programming puzzles from recent competitive coding competitions designed to benchmark advanced reasoning skills while mitigating effects of data contamination.
While these evaluation benchmarks were useful when they first came out, as models started to overfit them, their application has become less valuable. Multiple coding models fine tuned from GPT-4 pre-trained models are scoring above 85 on the Pass@1 metric. Additionally, the methodology via which these models are evaluated against these benchmarks are often non-standardized, lacking a core requirement for comparing scores across tests or over time.
The goal of the Scale AI coding evaluation is to establish a uniform framework for evaluating LLMs’ coding capabilities.
Prompts Dataset Description
The Scale Coding Evaluation provides a standardized assessment framework for LLMs, enabling comparisons across models and identifying their strengths and weaknesses. It currently encompasses a set of use cases across the most requested coding languages.
Use Case | Definition | # of prompts |
---|---|---|
Generation | Create new code from a set of specifications or descriptions given in natural language. | 207 (20.7%) |
Fixing | Identify and correct errors in existing code. For example, debugging, resolving syntax errors and fixing logical mistakes. | 153 (15.3%) |
Understanding | Explain, interpret or clarify existing code. For example, elucidating how certain code segments work, breaking down complex algorithms. | 144 (14.4%) |
Modification | Make changes or adjustments to existing code to meet new requirements or conditions. For example, altering functionality, updating or enhancing features. | 102 (10.2%) |
Optimization | Improve the performance of existing code. For example, enhancing efficiency, reducing resource consumption (like memory or processing time). | 99 (9.9%) |
Learning | Assist with learning or understanding programming concepts, languages or tools. For example, guidance on best practices, explanation of programming concepts. | 96 (9.6%) |
Translation | Convert code from one programming language to another with code structures, styles and idioms adapted to the best practices of the target language. | 52 (5.2%) |
Recommendations | Provide suggestions or advice on coding practices, tools, libraries or frameworks. | 50 (5.0%) |
Commenting | Add or improve comments in existing code. | 49 (4.9%) |
Testing | Develop, enhance or fix tests for existing code. | 48 (4.8%) |
Data Sample
1/10
Use Case:
Generation
Definition:
Create new code from a set of specifications or descriptions given in natural language.
# of Prompts:
207 (20.7%)
User
I want you to make a program in Python. This program will predict how long a robot vacuum takes to clean a room. The inputs to the program are the size of the room in square meters and battery level as a percentage. The base cleaning time is 30 minutes for a 20-square-meter room. But it takes two more minutes for each extra square meter over 20. So, a bigger room means a longer cleaning time. The battery level also affects the cleaning time. If the battery isn't at 100%, the vacuum works slower. For every 1%, the battery is below 100%, and the cleaning takes 0.5% longer. For example, a room size of 50 square meters and a battery of 75% are used. Explain how to calculate the predicted cleaning time for those numbers.
The dataset was created by a group of coding annotators screened and selected for the project. They were selected on the basis of software development / programming / data science experience and qualifications, ensuring coverage of all required coding languages.
Dataset constructions steps:
- Instructions creation. We generated an instruction document to guide human annotators contributors on how to generate the initial set of prompt-response pairs.
- Controlled access. Contributors were prohibited from using any publicly available code or questions, including those from StackOverflow or public GitHub repositories. This prevents possible eval set contamination, ensuring the creation of novel problems the models haven’t seen in their training sets.
- Initial attempts. Contributors were guided to create initial attempts, an initial prompt-response set, following a predetermined distribution by coding language and programming task.
- Quality Control. We designed a 2-stage pipeline:
- Based on initial attempts’ quality, the operators identified a set of trusted reviewers
- Trusted reviewers performed multiple review rounds to evaluate the prompt-response pairs accuracy, coherence, and relevance.
- In addition to standard reviews, code execution tests on the model answers and independent audits were performed. We filtered down the initial set of 2.5k prompts to 1.5k reviewed prompts-responses.
- Independent final audit. The team commissioned a set of internal independent auditors with a final review of the quality of the prompts and answers. Low quality prompts or answers were filtered out. This final screen reduced the initial set down to 1,000 quality-validated prompt-response pairs.
Evaluation Taxonomy
To capture a nuanced assessment, we created an evaluation taxonomy specific to coding tasks. Each model response was evaluated across a set of standalone criteria, covering each of the use cases, and side-by-side with another model response to measure preference ranking on a 7-point likert scale:
- Prompt Adherence: Does the model strictly adhere to the given prompt and comprehend all its requirements?
- Ratings: Yes, No
- Correctness:
- To what extent are the claims in the response truthful and correct? Verifying Correctness often requires external research.
- If code was present in the response, does it execute and produce the correct output?
- Evaluators are encouraged to use any means possible to test code (e.g., writing simple programs to test functions and code snippets), however, Code Correctness may not be measured if, for example, the code only functions when embedded inside a large, complex program that is not provided, or if it requires an external file/API dependency that is not provided.
- Ratings: Yes, No
- Performance / Efficiency: Does the code execute without any performance concerns?
- Ratings: Yes, No
- Readability / Documentation: Is the written explanation well-structured and visually organized? Does the response include necessary documentation aiding in code understanding? Is the code readable, employing proper formatting and mnemonic variable and function names?
- Ratings: Yes, No
- Overall Side-by-Side: Considering the dimensions above relevant to the specific task, how does the overall quality of the two model responses compare?
- Ratings: Score between 1 to 7, where 1 signifies that model 1’s response was much better than model 2’s response and 7 signifies the opposite.
Expert raters were required to provide further detailed justifications regarding (1) any model issues identified if they selected “No” in any of the above dimensions, and (2) the side-by-side rating, in an open-text format. Additionally, for all the responses that involve a nontrivial code snippet, we ask the annotators to compile and run the code in our tooling in order to fully catch compilation and runtime issues.
Evaluation Methodology
In our evaluation, each model is paired with every other model at least 50 times, and each pairing receives a randomly chosen prompt from the set of 1,000 prompts described above.
Each evaluation tasks consists of the following:
- Two models generate the responses for a prompt
- Annotators provide a point-wise evaluation of each response
- Annotators express their preference between the two scores on a 7-point likert scale
To ensure thoroughness and reliability in the evaluation process, each task was executed in parallel 3 times by different human annotators. Then, the ratings were then reviewed in two stages: an initial review layer and a final review layer. The figure below provides an overview of the evaluation pipeline design. After finalizing the tasks, a team of internal independent auditors randomly selected and reviewed 10% of the tasks for quality control.
Beyond producing overall ranking, this evaluation methodology enables slicing of the evaluation data by programming languages and use cases, to help highlight models’ strengths and weaknesses across different areas, to help answer questions like: how does a model perform compared to the reference model on SQL, Java, HTML/CSS, and C++ prompts? How competitive is a particular model in specific functions, from less code-intensive tasks like Understanding and Recommendations, to complex scenarios like Translation?
Evaluation Insights Summary
- Most Models typically perform well in "Commenting" tasks and "Understanding" tasks, but they often face difficulties with "Testing" tasks and "Generation" tasks. However, the newly released model o1-preview shows a remarkable improvement in testing with an overall correctness of 84.2%.
- The evaluation dimension “Correctness / Functionality” is the primary source of error across models followed by “Prompt Adherence”.
- Model version comparison:
- o1 (o1-preview, o1-mini):
- The new models excel in all dimensions and use cases due to superior prompt adherence and correctness.
- o1-mini surpasses the GPT-4 series in code generation with an impressive 80.8% overall correctness. GPT-4 models historically excelled in this use case and o1-mini continues this trend.
- Both models significantly improved in testing, a use case that challenged also the GPT-4 models. Specifically, o1-preview ranks #1 with an overall correctness of 84.2%, the highest across all models, while o1-mini ranks #2. However, both models increasingly struggle with commenting, ranking #8 and #15 respectively.
- GPT (GPT-4 Turbo Preview, GPT-4o (May, 2024), GPT-4o (August, 2024)):
- The three GPT-4 models demonstrate consistent performance across use cases, especially in generation and learning, but struggle with testing. The latest GPT model, GPT-4o (August 2024) ranks #4 and has only one use case, Testing, where its overall correctness rate is below 50%.
- Mistral (Mistral Large, Mistral Large 2):
- Mistral Large 2 greatly outperforms Mistral Large. It has remarkably improved on all dimensions, now ranking #1 in Recommendations for various use cases and #8 for overall preference. Mistral Large is among the lowest-performing models, ranking #15 in overall correctness.
- The most notable improvement is in the use case of 'Translation,' where overall correctness increased from 27.5% to 74.26%. Specifically, within this use case, the error rate in the ‘Correctness/Functionality’ 'Readability / Documentation' dimension showed the largest delta.
- Gemini (Gemini 1.5 Pro (August 27 2024), Gemini 1.5 Pro (May, 2024), Gemini 1.5 Flash, Gemini 1.5 Pro (April, 2024)):
- The new Gemini model (Gemini 1.5 Pro (August 27 2024)) shows significant improvements over the previous version (Gemini 1.5 Pro (May 2024) ), especially on “Correctness / Functionality”.
- The latest model ranks #1 in commenting with an overall correctness of 87.1%, further improving the high overall correctness (80%) of the version released in May.
- Claude (Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Sonnet):
- Claude 3.5 Sonnet has significantly improved over Claude 3 Opus and Claude 3 Sonnet, securing the #31 spot in the leaderboard (compared to #128 and #1511 for Opus and Claude 3 Sonnet). Its most notable enhancement is in prompt adherence, surpassing Opus, the previous Sonnet model, as well as the GPT-4 and Gemini models.
- Claude 3 Opus generally outperforms Claude 3 Sonnet, particularly in making fewer errors in the "Correctness/Functionality" use case, except for certain Translation tasks.
- Llama (Llama 3.2.90B Vision Instruct, Llama 3.1 405B Instruct, Llama 3 70B Instruct ):
- The latest Llama Models show great improvement, especially in commenting tasks. Llama 3.2 90B Vision Instruct ranks #3, Llama 3.1 405B Instruct ranks #2, and Llama 3 70B Instruct ranks #14 for commenting.
- Both Llama 3.1 405B and Llama 3.2 90B Vision Instruct have demonstrated improvements across all four dimensions compared to the previous Llama 3 70B Instruct model. The most notable enhancement is in prompt adherence. Llama 3.2 90B Vision Instruct outperforms the other LlLama models in both the Understanding and Fixes use cases
- o1 (o1-preview, o1-mini):
Top 8 model insight highlights:
4 evaluation dimensions are Prompt adherence / Understanding, Correctness / Functionality, Performance / Efficiency, Readability / Documentation
* Note: The strengths and weaknesses mentioned are relative to other models.
Please refer to Appendix B for more detailed analysis beyond this high level summary.
Acknowledgments
We extend our deepest gratitude to the dedicated team of annotators, operators and researchers who made this project possible.
Scale AI team: Dean Lee*, Edwin Pan, Johannes Mols, Mike Lunati, Antony Tokarr, Cristina Menghini, Daniel Berrios, William Qian, Kenneth Murphy, Summer Yue
Appendix A - Prompt Diversity
Controlling command diversity is essential for model evaluation because doing so exercises the versatility and comprehensiveness of the target models in handling a wide range of instructions or directives. A model with high command diversity can effectively understand and respond to various types of commands, including different task-specific instructions, requests, prompts for information, or actions to be performed. During prompt collection, frequent combinations like “create-program” were caught by automated checks and discouraged in favor of different wordings with similar meanings. To analyze command diversity for coding prompts, all code is removed such that only the natural language remains.
Appendix B - Evaluation Insights
Model performance overview
“Correctness / Functionality” is the primary source of error across models.
Performance consistency analysis (Coefficient of variation)
- Lowest CV (indicates consistent model performance across use cases):
- o1-mini
- o1-preview
- Highest CV (suggests strong performance in certain use cases but poor performance in others):
- CodeLlama 34B Instruct
- Gemini 1.0 Pro
Model | Score | 95% Confidence |
---|---|---|
1st | 1265 | +40/-32 |
2nd | 1195 | +32/-32 |
1115 | +24/-24 | |
1086 | +28/-31 | |
1076 | +26/-26 | |
1074 | +22/-23 | |
1073 | +29/-29 | |
1072 | +28/-27 | |
1062 | +25/-25 | |
1022 | +27/-24 | |
1020 | +30/-34 | |
995 | +22/-23 | |
972 | +27/-25 | |
931 | +27/-30 | |
916 | +27/-29 | |
912 | +24/-25 | |
852 | +28/-28 | |
726 | +33/-33 | |
636 | +37/-39 |