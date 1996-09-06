Convert code from one programming language to another with code structures, styles and idioms adapted to the best practices of the target language.

Assist with learning or understanding programming concepts, languages or tools. For example, guidance on best practices, explanation of programming concepts.

Make changes or adjustments to existing code to meet new requirements or conditions. For example, altering functionality, updating or enhancing features.

The Scale Coding Evaluation provides a standardized assessment framework for LLMs, enabling comparisons across models and identifying their strengths and weaknesses. It currently encompasses a set of use cases across the most requested coding languages.

While these evaluation benchmarks were useful when they first came out, as models started to overfit them, their application has become less valuable. Multiple coding models fine tuned from GPT-4 pre-trained models are scoring above 85 on the Pass@1 metric. Additionally, the methodology via which these models are evaluated against these benchmarks are often non-standardized, lacking a core requirement for comparing scores across tests or over time.The goal of the Scale AI coding evaluation is to establish a uniform framework for evaluating LLMs’ coding capabilities.

The Scale AI Coding Prompts Set comprises 1,000 prompts spanning a diverse array of programming languages, disciplines, and programming tasks. This dataset encompasses a broad spectrum of software engineering tasks, ranging from debugging to code optimization and from documentation generation to understanding complex code bases. While the general use and understanding of LLMs’ usage for coding applications has grown, there are limited tools or benchmarks available to compare on a like-for-like basis different models. The most well-known ones include:

I want you to make a program in Python. This program will predict how long a robot vacuum takes to clean a room. The inputs to the program are the size of the room in square meters and battery level as a percentage. The base cleaning time is 30 minutes for a 20-square-meter room. But it takes two more minutes for each extra square meter over 20. So, a bigger room means a longer cleaning time. The battery level also affects the cleaning time. If the battery isn't at 100%, the vacuum works slower. For every 1%, the battery is below 100%, and the cleaning takes 0.5% longer. For example, a room size of 50 square meters and a battery of 75% are used. Explain how to calculate the predicted cleaning time for those numbers.

Use Cases Distribution

Coding Languages Distribution

The dataset was created by a group of coding annotators screened and selected for the project. They were selected on the basis of software development / programming / data science experience and qualifications, ensuring coverage of all required coding languages.



Dataset constructions steps:

Instructions creation. We generated an instruction document to guide human annotators contributors on how to generate the initial set of prompt-response pairs. Controlled access. Contributors were prohibited from using any publicly available code or questions, including those from StackOverflow or public GitHub repositories. This prevents possible eval set contamination, ensuring the creation of novel problems the models haven’t seen in their training sets. Initial attempts. Contributors were guided to create initial attempts, an initial prompt-response set, following a predetermined distribution by coding language and programming task. Quality Control. We designed a 2-stage pipeline: Based on initial attempts’ quality, the operators identified a set of trusted reviewers Trusted reviewers performed multiple review rounds to evaluate the prompt-response pairs accuracy, coherence, and relevance. In addition to standard reviews, code execution tests on the model answers and independent audits were performed. We filtered down the initial set of 2.5k prompts to 1.5k reviewed prompts-responses. Independent final audit. The team commissioned a set of internal independent auditors with a final review of the quality of the prompts and answers. Low quality prompts or answers were filtered out. This final screen reduced the initial set down to 1,000 quality-validated prompt-response pairs.

Evaluation Taxonomy

To capture a nuanced assessment, we created an evaluation taxonomy specific to coding tasks. Each model response was evaluated across a set of standalone criteria, covering each of the use cases, and side-by-side with another model response to measure preference ranking on a 7-point likert scale:

Prompt Adherence: Does the model strictly adhere to the given prompt and comprehend all its requirements? Ratings: Yes, No Correctness: To what extent are the claims in the response truthful and correct? Verifying Correctness often requires external research. If code was present in the response, does it execute and produce the correct output? Evaluators are encouraged to use any means possible to test code (e.g., writing simple programs to test functions and code snippets), however, Code Correctness may not be measured if, for example, the code only functions when embedded inside a large, complex program that is not provided, or if it requires an external file/API dependency that is not provided. Ratings: Yes, No Performance / Efficiency: Does the code execute without any performance concerns? Ratings: Yes, No Readability / Documentation: Is the written explanation well-structured and visually organized? Does the response include necessary documentation aiding in code understanding? Is the code readable, employing proper formatting and mnemonic variable and function names? Ratings: Yes, No Overall Side-by-Side: Considering the dimensions above relevant to the specific task, how does the overall quality of the two model responses compare? Ratings: Score between 1 to 7, where 1 signifies that model 1’s response was much better than model 2’s response and 7 signifies the opposite.

Evaluation Methodology

Expert raters were required to provide further detailed justifications regarding (1) any model issues identified if they selected “No” in any of the above dimensions, and (2) the side-by-side rating, in an open-text format. Additionally, for all the responses that involve a nontrivial code snippet, we ask the annotators to compile and run the code in our tooling in order to fully catch compilation and runtime issues.

In our evaluation, each model is paired with every other model at least 50 times, and each pairing receives a randomly chosen prompt from the set of 1,000 prompts described above.



Each evaluation tasks consists of the following:

Two models generate the responses for a prompt Annotators provide a point-wise evaluation of each response Annotators express their preference between the two scores on a 7-point likert scale

Evaluation Methodology - Pipeline Design

To ensure thoroughness and reliability in the evaluation process, each task was executed in parallel 3 times by different human annotators. Then, the ratings were then reviewed in two stages: an initial review layer and a final review layer. The figure below provides an overview of the evaluation pipeline design. After finalizing the tasks, a team of internal independent auditors randomly selected and reviewed 10% of the tasks for quality control.

Beyond producing overall ranking, this evaluation methodology enables slicing of the evaluation data by programming languages and use cases, to help highlight models’ strengths and weaknesses across different areas, to help answer questions like: how does a model perform compared to the reference model on SQL, Java, HTML/CSS, and C++ prompts? How competitive is a particular model in specific functions, from less code-intensive tasks like Understanding and Recommendations, to complex scenarios like Translation?

Evaluation Insights Summary

Models typically perform well in "Commenting" tasks and "Understanding" tasks, but they often face difficulties with "Translation" tasks and "Generation" tasks. “Correctness / Functionality” and “Readability / Documentation” are the two primary sources of error across models. Model version comparison: GPT (gpt-4-0125-preview, gpt-4o-2024-05-13): The two GPT models demonstrate the most consistent performance across use cases. Although both GPT4 models are top-ranked on the leaderboard, the newer model (gpt-4o-2024-05-13) tends to exhibit more readability issues than gpt-4-0125-preview, occasionally repeating code from the prompt unnecessarily, which could result in more verbose responses. Gemini (gemini-1.5-pro-preview-0514, gemini-1.5-flash-preview-0514, gemini-1.5-pro-preview-0409): All three Gemini models excelled in the Recommendation task, ranking #1, #2, and #3 respectively, but they struggled more with testing tasks. The new Gemini model (gemini-1.5-pro-preview-0514) shows significant improvements over the previous version (gemini-1.5-pro-preview-0409), especially on “Correctness / Functionality” and “Readability / Documentation”. Claude (claude-3-5-sonnet-20240620, Claude-3-opus-20240229, claude-3-sonnet-20240229): Claude 3.5 Sonnet has significantly improved over Claude 3 Opus and Claude 3 Sonnet, securing the #1 spot in the leaderboard (compared to #5 and #9 for Opus and Claude 3 Sonnet). Its most notable enhancement is in prompt adherence, surpassing Opus, the previous Sonnet model, as well as the GPT and Gemini models. claude-3-opus-20240229 generally outperforms claude-3-sonnet-20240229, particularly in making fewer errors in the "Correctness/Functionality" category, except for certain Translation tasks. Llama (llama-3.1-405b-instruct, llama-3-70b-instruct): Llama 3.1 405B Instruct has shown marked improvements across four dimensions compared to the previous Llama 3 70B model. The most notable enhancement is in prompt adherence, particularly in areas like comment modification and testing, where it has demonstrated superior performance.

Top 8 model insight highlights:

*4 evaluation dimensions are Prompt adherence / Understanding, Correctness / Functionality, Performance / Efficiency, Readability / Documentation

* Note: The strengths and weaknesses mentioned are relative to other models.



Please refer to Appendix B for more detailed analysis beyond this high level summary.

Acknowledgments

We extend our deepest gratitude to the dedicated team of annotators, operators and researchers who made this project possible.



Scale AI team: Dean Lee*, Mike Lunati, Antony Tokarr, Cristina Menghini, Daniel Berrios, William Qian, Kenneth Murphy, Summer Yue

Appendix A - Prompt Diversity

Controlling command diversity is essential for model evaluation because doing so exercises the versatility and comprehensiveness of the target models in handling a wide range of instructions or directives. A model with high command diversity can effectively understand and respond to various types of commands, including different task-specific instructions, requests, prompts for information, or actions to be performed. During prompt collection, frequent combinations like “create-program” were caught by automated checks and discouraged in favor of different wordings with similar meanings. To analyze command diversity for coding prompts, all code is removed such that only the natural language remains.

Appendix B - Evaluation Insights

Model performance overview

Models typically perform well in "Commenting" tasks (58% overall correctness) and "Understanding" tasks (51% overall correctness), but they often face difficulties with "Translation" tasks and "Generation" tasks (both with 30% overall correctness). (See "Data Samples" for each use case)



“Correctness / Functionality” and “Readability / Documentation” are the two primary sources of error across models.

Performance consistency analysis (Coefficient of variation)

Lowest CV (indicates consistent model performance across use cases): gpt-4-0125-preview gpt-4o-2024-05-13 Highest CV (suggests strong performance in certain use cases but poor performance in others): codellama-34b-instruct gemini-1.0-pro-001

Model version comparison

gpt-4-0125-preview vs. gpt-4o-2024-05-13

Although both GPT4 models are top-ranked on the leaderboard, the newer model (gpt-4o-2024-05-13) tends to exhibit more readability issues than gpt-4-0125-preview, occasionally repeating code from the prompt unnecessarily, which could result in more verbose responses.The most significant performance decline in the gpt-4o-2024-05-13 is observed in Translation tasks, especially in terms of “Prompt adherence / Understanding” (error rate increased from 6% to 16%) and “Correctness / Functionality” (error rate increased from 14% to 32%).

Example

The example below demonstrates why gpt-4-0125-preview is preferred over gpt-4o-2024-05-13, pointing out the correctness and readability issues of the gpt-4o-2024-05-13.



Justification When GPT 4 0125 Preview Wins

"Gpt-4-0125-preview outperforms gpt-4o-2024-05-13 because it adheres perfectly to the prompt instructions and produces error-free text and code. In contrast, while gpt-4o-2024-05-13 also adheres to the prompt, it has issues with correctness and readability. Specifically, gpt-4o-2024-05-13 inaccurately states that student "Hannah" qualifies for "Leadership" with a 3.5 GPA, conflicting with the criteria in the CTE, which requires a 3.6 GPA or higher. Furthermore, the gpt-4o-2024-05-13 model repeats the prompted SQL code in its output unnecessarily, making it verbose and reducing readability. Similarly, the explanation of the sample data was redundant. The inserted data is descriptive and self-explanatory.”

Example 1/3 Prompt The following SQL query is produced to assign free training to students. The eligibility of students to enroll in each training is determined based on their GPA and department. Students who enrolled in Technology and Engineering with a GPA greater than or equal to 3.5 are eligible to enroll in 'Basic Java Tutorial' training, while "Science" students are eligible for 'Space Science" if their GPA is at least 3.5. Lastly 'Management' students are eligible for "Leadership" Training. Can you please generate sample data for both tables for testing the query? SQL 1 CREATE TABLE Students ( 2 StudentID INT PRIMARY KEY , 3 Name VARCHAR ( 255 ) , 4 Department VARCHAR ( 255 ) , 5 GPA DECIMAL ( 3 , 2 ) 6 ) ; 7 8 CREATE TABLE Trainings ( 9 TrainingID INT PRIMARY KEY , 10 TrainingName VARCHAR ( 255 ) 11 ) ; 12 13 WITH Candidates AS ( 14 SELECT 15 s . StudentID , 16 s . Name , 17 s . Department , 18 s . GPA , 19 CASE 20 WHEN s . GPA >= 3.5 AND s . Department IN ( 'Engineering' , 'Technology' ) THEN 'Basic Java Tutorial' 21 WHEN S . GPA >= 3.5 AND s . Department = 'Science' THEN 'Space Science' 22 WHEN s . GPA >= 3.6 AND s . Department = 'Management' THEN 'Leadership' 23 ELSE 'Not Eligible for Training' 24 END AS TrainingName 25 FROM Students s 26 ) , 27 28 Ranking AS ( 29 SELECT * , RANK ( ) OVER ( ORDER BY GPA DESC ) Ranking 30 FROM Candidates 31 ) 32 SELECT * 33 FROM Ranking 34 ORDER BY Ranking ;

gemini-1.5-pro-preview-0409 vs. gemini-1.5-pro-preview-0514

The new Gemini model (gemini-1.5-pro-preview-0514) shows significant improvements over the previous version (gemini-1.5-pro-preview-0409) across all use cases except for recommendation, where the previous model excels. Overall correctness increased from 39% to 52%.In terms of error dimensions, the new model improved most in “Correctness / Functionality” (error rate dropped from 35% to 25%) and “Readability / Documentation” (error rate dropped from 30% to 24%).

Example

The example below demonstrates why gemini-1.5-pro-preview-0514 is preferred over gemini-1.5-pro-preview-0409, pointing out the correctness and readability issues of the gemini-1.5-pro-preview-0409.



Justification When Gemini-1.5-pro-preview-0514 Wins

“Gemini-1.5-Pro-Preview-0409 has a minor issue with the CSS of the .nav element. The original parent .nav styling was wrongly removed. This removal adversely affects the style, particularly the margins and padding, leading to an unclear display and improper styling from the user's perspective. The headers are hard to distinguish as the white background overlaps. Additionally, the website is not responsive and does not resize correctly with changes in the viewport. The issue is the fixed width of the .container class, which is set to 960px. Changing this to max-width: 960px, as seen in R2, would make the website responsive. Lastly, the CSS code is missing comments, which reduces readability.



In contrast, Gemini-1.5-Pro-Preview-0514 offers a more structured approach. It enhances accessibility by including the lang attribute and demonstrates responsive design by effectively using Flexbox. While Response 1 also used Flexbox, it was wrongly implemented. The adjustments to padding and margins contribute to a polished visual appearance, reflecting a user interface design that is particularly suitable for modern contexts. Its code is well-organized, with clear explanations and commented lines indicating edits, enhancing readability.”

Example 1/3 Prompt The below code includes inline CSS styles, non-semantic HTML markup, redundant CSS properties, and inefficient layout techniques. It's not optimized for performance, maintainability, or scalability, but it serves as an example of how code might look if not carefully designed and structured. Provide an optimized code for the below HTML code: HTML 1 <! DOCTYPE html > 2 < html > 3 < head > 4 < title > HTML and CSS </ title > 5 < style > 6 /* CSS code */ 7 body { 8 background-color : #ffffff ; 9 font-family : Arial , sans-serif ; 10 } 11 12 .container { 13 width : 960 px ; 14 margin : 0 auto ; 15 padding : 20 px ; 16 } 17 18 .header { 19 background-color : #333333 ; 20 color : #ffffff ; 21 padding : 20 px ; 22 } 23 24 .nav { 25 background-color : #cccccc ; 26 padding : 10 px ; 27 margin-top : 20 px ; 28 } 29 30 .nav ul { 31 list-style-type : none ; 32 padding : 0 ; 33 margin : 0 ; 34 } 35 36 .nav ul li { 37 display : inline ; 38 margin-right : 10 px ; 39 } 40 41 .nav ul li a { 42 text-decoration : none ; 43 color : #333333 ; 44 padding : 5 px 10 px ; 45 background-color : #ffffff ; 46 } 47 48 .nav ul li a :hover { 49 background-color : #666666 ; 50 color : #ffffff ; 51 } 52 53 .sidebar { 54 float : left ; 55 width : 250 px ; 56 background-color : #cccccc ; 57 padding : 20 px ; 58 margin-top : 20 px ; 59 } 60 61 .content { 62 float : left ; 63 width : 660 px ; 64 background-color : #eeeeee ; 65 padding : 20 px ; 66 margin-top : 20 px ; 67 } 68 69 .footer { 70 clear : both ; 71 background-color : #333333 ; 72 color : #ffffff ; 73 padding : 20 px ; 74 text-align : center ; 75 } 76 </ style > 77 </ head > 78 < body > 79 <!-- HTML code --> 80 < div class = " container " > 81 < div class = " header " > 82 < h1 > Website </ h1 > 83 </ div > 84 < div class = " nav " > 85 < ul > 86 < li > < a href = " # " > Home </ a > </ li > 87 < li > < a href = " # " > About </ a > </ li > 88 < li > < a href = " # " > Services </ a > </ li > 89 < li > < a href = " # " > Contact </ a > </ li > 90 </ ul > 91 </ div > 92 < div class = " sidebar " > 93 < h2 > Sidebar </ h2 > 94 < p > This is the sidebar content. </ p > 95 </ div > 96 < div class = " content " > 97 < h2 > Main Content </ h2 > 98 < p > This is the main content area. </ p > 99 </ div > 100 < div class = " footer " > 101 < p > Website </ p > 102 </ div > 103 </ div > 104 </ body > 105 </ html >

Claude-3-opus-20240229 vs. claude-3-sonnet-20240229

claude-3-opus-20240229 generally outperforms claude-3-sonnet-20240229, particularly in makingcategory (error rate dropped from 36% to 27%), except for certain Translation tasks.

Example

The example below demonstrates why claude-3-opus-20240229 is preferred over claude-3-sonnet-20240229, pointing out the correctness issues of the claude-3-sonnet-20240229.



Justification When Claude-3-opus-20240229 Wins

“Opus is the better model because it produces the correct output and adheres to the prompt instructions, which specify iterating until 10. In contrast, Sonnet stops after the first iteration. Opus uses a CTE to generate numbers from 2 to 9. It employs a 'NOT EXISTS' subquery to check for any divisors smaller than the current number that divide it evenly, effectively selecting prime numbers from the 'prime_numbers' CTE and renaming the column to 'prime_number'.

Example 1/3 Prompt Create a sql query which has the functionality of loops like other programming languages, Do 10 iterations of the loop and print the prime numbers till 10.

Claude 3.5 Sonnet Examples

Example #1 - Python Optimization, Claude 3.5 Sonnet vs. gpt-4o

Language: Python Use case: Optimization Why did Sonnet win? Sonnet successfully manages overlapping reservations, optimizes table usage by assigning the smallest available table, and considers varying table capacities as required by the prompt. In contrast, GPT-4o struggles with handling overlapping reservations, a critical requirement specified in the prompt, resulting in poor prompt adherence. Details: Claude 3.5 Sonnet's implementation includes a table class with a defined number of tables in the tables array and a method to find the next available table. It ensures proper handling of appointment times by enforcing a 2-hour gap after each booking, enhancing the reliability of the booking system. While lacking inline comments, Sonnet's approach is robust and effective. Conversely, GPT-4o's response has both prompt adherence and correctness issues. It creates a table size array that lacks specific table numbers. Therefore, reservations are not assigned to an actual table number but only to a table size. Moreover, it incorrectly verifies table availability, allowing consecutive bookings as long as the dates and times are not an exact match and only minutes apart. These issues indicate a significant lack of adherence to the prompt requirements, compromising the functionality and reliability of the booking system compared to Sonnet 3.5.

Example 1/3 Prompt I developed a system for managing restaurant reservations. My version allows customers to book tables specifying the date, time, and number of guests but it's not efficiently handling overlapping reservations or optimizing the usage of tables for table sizes varying from 2, 4, 6, 8 to accommodate as many reservations as possible and it does not account for varying table capacities. I need you to optimize my code by adding a new function "find_available_table". Python 1 class ReservationSystem : 2 def __init__ ( self ) : 3 self . reservations = [ ] 4 def add_reservation ( self , date , time , num_guests ) : 5 for reservation in self . reservations : 6 if reservation [ 0 ] == date and reservation [ 1 ] == time : 7 print ( "Reservation conflict! Cannot book." ) 8 return False 9 self . reservations . append ( ( date , time , num_guests ) ) 10 print ( "Reservation added successfully." ) 11 return True 12 system = ReservationSystem ( ) 13 system . add_reservation ( "2023-03-15" , "19:00" , 4 ) 14 system . add_reservation ( "2023-03-15" , "19:00" , 2 )

Example #2 - Python Testing, Claude 3.5 Sonnet vs. Claude 3 Opus

Language: Python Use case: Testing Why did Sonnet lose? Sonnet faces an issue with sorting indices, resulting in a failed test case. In contrast, Claude 3 Opus utilizes direct assertions to validate function outputs, offering a simpler and more concise approach than using unittest.TestCase assertions. Details: Sonnet encounters an error in its handling of index sorting in get_ranking_indices(). The function find_min_index() used within get_ranking_indices() returns indices in the order they are found, which may not align with the sorted order of values in ranking_array. This discrepancy causes test case 5 to fail, where the expected output [5, 7, 4, 0, 6, 1, 3, 2] differs from the actual output [5, 7, 4, 2, 0, 6, 1, 3]. Opus directly uses assertions to validate the function's outputs, which is simpler and more concise compared to using unittest.TestCase assertions. This direct approach ensures straightforward and effective testing, focusing directly on function behavior and correctness.

Example 1/3 Prompt The "find_min_index()" method is responsible for finding the index of the minimum value in a given list. The "get_ranking_indices()" method utilizes the "find_min_index()" method to arrange the indices of the elements in the list from the index of the minimum element to the index of the maximum element. Could you generate some test cases for this code? Python 1 def find_min_index ( array , visited_indices ) : 2 ''' 3 This function finds the index of the minimum value in the array that has not been visited. 4 5 Parameters: 6 - array (list): The input array to search for the minimum value. 7 - visited_indices (dict): A dictionary to keep track of visited indices. 8 9 Returns: 10 - int or None: The index of the minimum value, or None if no unvisited minimum value is found. 11 ''' 12 13 min = None 14 min_idx = None 15 16 for i in range ( len ( array ) ) : 17 if min is None and i not in visited_indices : 18 min = array [ i ] 19 min_idx = i 20 elif i not in visited_indices and array [ i ] < min : 21 min = array [ i ] 22 min_idx = i 23 24 return min_idx 25 26 27 def get_ranking_indices ( ranking_array ) : 28 ''' 29 Get the indices of the elements in the ranking_array in ascending order. 30 31 Parameters: 32 - ranking_array (list): The array containing ranking values. 33 34 Returns: 35 - list: The list of indices corresponding to the elements in the ranking_array in ascending order. 36 ''' 37 ranking_indices = [ ] 38 visited_indices = { } 39 40 while len ( ranking_indices ) != len ( ranking_array ) : 41 min_idx = find_min_index ( ranking_array , visited_indices ) 42 43 ranking_indices . append ( min_idx ) 44 visited_indices [ min_idx ] = True 45 46 return ranking_indices

Win / Loss visualizations