scale logo
<- Back to leaderboard

Agentic Tool Use

Introduction

The ability of agentic LLMs to chain multiple tool calls together compositionally to solve tasks is an open problem. The ToolComp benchmark comprises 485 meticulously crafted prompts & final answers designed to evaluate the proficiency of AI models in tasks necessitating dependent tool usage. This benchmark is distinguished by prompts that require composing multiple tools together, golden answer chains, and process supervision labels, to create a more thorough & accurate evaluation of an AI model's reasoning and tool-calling abilities. We break up ToolComp into two subsets, one called ToolComp-Enterprise which tests usage of 11 tools and another called ToolComp-Chat which tests usage of 2 common chatbot tools (Python Interpreter and Google Search).

In comparison to other benchmarks in the field, such as ToolBench, API Bench, and API-Bank, ToolComp critically combines compositional tool use with human-verified final answers, which can be evaluated automatically. Existing benchmarks either lack dependent tool usage, final human-verified answers, or rely on artificial tools with limited outputs, failing to provide scenarios that truly mirror the sequential and interdependent tool use required in real-world, enterprise settings. This omission leads to a gap in effectively providing granular feedback for localizing errors, making incremental improvements and enhancing AI capabilities in real-world applications, where complex, step-by-step reasoning and the execution of multiple dependent tools are essential to get to a final correct answer. A summary of the contributions and metadata for existing tool use benchmarks is provided in Table 1.

To bridge this gap and better align the needs of AI applications with the capabilities of benchmarks, we introduce ToolComp, a tool-use benchmark designed to meet the evolving demands of agentic model makers seeking to rigorously test and scale their models in practical, dynamic environments.

Table 1. The contributions and metadata of popular benchmarks in Tool Use. Our work, ToolComp, is shown in the first column. *Although APIBench technically only uses only 3 tools, its novelty comes from having hundreds of possible APIs to use in Python.

Dataset Description

ToolComp consists of 485 examples of prompts and labels containing examples of dependent tool calling, as shown below. We define dependent tool calling as the need to call multiple tools in sequence such that the output of a previous tool must be used to motivate the input for a subsequent tool. Note that the `action_input` for the `finish` action is the final answer.

Data Sample

1/5

Category:

Geometry

person

User

Suppose you have a 2 inch x 2 inch x 2 inch box. How many of these boxes can you fit into the biggest U-Haul truck (biggest truck as of 2024)?

Assistant:

Step 0:
1. Use the google\_search tool to find the name of the biggest uhaul truck and find its dimensions
2. Use python\_interpreter to calculate the total number of boxes that would fit given the dimensions
3. Return the answer

Step 1:
thought: I need to find the dimensions of the biggest U-Haul truck.
action: google_search
action_input: {"query": "biggest U-Haul truck"}
observation:

{
  "error": "",
  "result": [
    {
      "source": "uhaul.com",
      "date": null,
      "title": "26ft Moving Truck Rental",
      "snippet": "The largest moving truck available at U-Haul, the 26ft moving truck is ideal for moving 5-8 room homes and designed for easy loading and comfortable ...",
      "highlight": "the 26ft moving truck"
    },
    {
      "source": "uhaul.com",
      "date": "Jun 14, 2023",
      "title": "Moving with a 26' U-Haul Truck - Moving Insider",
      "snippet": "The 26′ U-Haul truck is our largest rental moving truck for a good reason. It has: Inside dimensions of 26'5″x 7'8″x 8'3″ (LxWxH); Door ...",
      "highlight": "The 26′ U-Haul truck"
    },
    {
      "source": "hemmings.com",
      "date": "May 23, 2023",
      "title": "Driving a 26-Foot U-Haul Truck Is a Dumb Thing to Do",
      "snippet": "... largest truck that U-Haul rents out. It was a fine beast, a Ford F-650 Super Duty chassis with a Triton 6.8L gasoline V-10 under the hood ...",
      "highlight": "a Ford F-650 Super Duty chassis with a Triton 6.8L gasoline V-10 under the hood"
    },
    {
      "source": "uhaul.com",
      "date": null,
      "title": "U-Haul Moving Truck Sizes",
      "snippet": "View and compare all available U-Haul moving trucks with rates starting as low as $19.95, plus mileage. Truck options range from pickup trucks, cargo vans, ...",
      "highlight": "U | Haul moving trucks | Truck | pickup trucks"
    },
    {
      "source": "uhaul.com",
      "date": null,
      "title": "U-Haul 26' F650 Moving Truck: Overhead Clearance",
      "snippet": "U-Haul 26' F650 Moving Truck: Overhead Clearance · More videos on YouTube · Customer Service · Rentals · Products & Services · Company Info · Partner With Us ...",
      "highlight": "U | Haul | Moving Truck"
    },
    {
      "source": "10+ reactions  ·  2 years ago",
      "date": null,
      "title": "What Fits in a 26' Moving Truck (1).mp4 | box truck, video ...",
      "snippet": "Our 26ft moving truck is the largest truck available. ... video below! Have an upcoming move? Head to http://uhaulco. com/kx9F50FSLG8 to get ...",
      "highlight": "moving truck | largest truck"
    },
    {
      "source": "41.8K+ views  ·  1 year ago",
      "date": null,
      "title": "26' Uhaul Truck Review - What You Should Know",
      "snippet": "These are the things I wish someone would have told me before I rented the 26 foot uhaul truck. https://amzn.to/3E8NaKD - My Packing Supply ...",
      "highlight": "uhaul truck"
    },
    {
      "source": "uhaul.com",
      "date": null,
      "title": "6x12 Cargo Trailer Rental",
      "snippet": "The 6x12 cargo trailer is our largest moving trailer rental. It comes equipped with smooth floors, rub rails, multiple tie-downs and can carry almost 2500 ...",
      "highlight": "The 6x12 cargo trailer"
    },
    {
      "source": "uhaul.com",
      "date": "May 11, 2013",
      "title": "\"My First Time Driving a 26 Foot Moving Truck\"",
      "snippet": "The first driving a 26 foot moving truck can be an easy and fun one, if you just remember a few simple tips.",
      "highlight": "26 foot moving"
    }
  ]
}


Step 2:

thought: I now have the dimensions of the biggest U-Haul truck, so I can use Python to calculate how many boxes can fit inside.
action: python_interpreter
action_input:
{
  "code": "import math\n    box_length = 2\n    box_width = 2\n    box_height = 2\n    # convert truck dim into inches\n    truck_length = (26 * 12) + 5\n    truck_width = (7 * 12) + 8\n    truck_height = (8 * 12) + 3\n    num_boxes = math.floor(truck_length // box_length) * math.floor(truck_width // box_width) * math.floor(truck_height // box_height)\n    print(f'Total number of boxes: {num_boxes}')"
}

observation:
{
  "result": "Total number of boxes: 356132\n",
  "error": null
}


Step 3:

thought: I have the total number of boxes that can fit inside the truck, so I should return that as the answer.
action: finish
action_input: {"answer": "\[356132]"}
observation: 356132

Figure 1. Examples from ToolComp of prompt and their corresponding correct chain of dependent tool calls that get to the correct answer. The Assistant is asked to generate a high level action plan on how to get to the final answer as well as take a step by step approach using the ReAct format to make a sequence of dependent tools calls to reach the final answer. Additionally, each of the model generated substeps (Thought, Action and Action Input) is annotated with process supervision labels. If the step was corrected, the original incorrect step that was corrected is marked as a bad step. All other steps including steps that were marked as correct by the annotator and steps that an incorrect step was corrected to are marked as correct.


In creating the benchmark, we developed two subsets, ToolComp-Enterprise and ToolComp-Chat. ToolComp-Enterprise contains 11 tools and aims to emulate settings in which LLM agents must compose a larger number of expressive APIs together correctly, such as in enterprise settings. The second subset, ToolComp-Chat, is designed to test general purpose chatbots with the minimally sufficient set of tools for information retrieval and processing tasks, namely Google Search and Python Interpreter. The ToolComp-Chat setting leverages only web search and python execution as these are standard tools found in major chatbot providers. We only allow the respective tools for each subset during prompt generation, labeling, and evaluation.

ToolComp-Enterprise contains 287 examples while the ToolComp-Chat subset contains 198 examples. Each of the 11 tools are described below in Figure 3.

Tools

1/11

Category:

Date

Name:

date

Description:

Returns the current date (e.g. January 1, 2024)

Input Example:

{}

Output Example:

{
  "result": "Today is Friday, February 16, 2024.",
  "error": ""
}

Parameters:

[]

Figure 3. The figures above shows all 11 of our tools, including their descriptions, input and output arguments.

Prompt Creation

We utilized a hybrid synthetic approach leveraging Seed prompts to generate intermediary prompts which were then edited and improved using humans-in-the-loop. The overall process is described in more detail below.

Step 1: Develop In-Context Examples We crafted high-quality in-context examples with corresponding reasonings, which we call ‘processes’, to guide prompt generation. An example is illustrated in Figure 2 of Appendix C.

Step 2: Generate Initial Prompts Using the in-context examples, we generated synthetic prompts, ensuring diversity by selecting random subsets of the IC Examples. Each subset used distinct in-context prompts and randomly sampled tools from its set of available tools. The Seed prompt used in this step is shown in Appendix A.

Step 3: Human Refinement Annotators reviewed the prompts to resolve any issues related to complexity, clarity and ambiguity. We gave clear instructions on ambiguity (only one possible correct answer) and complexity (requires two or more tool calls to answer), instructing our annotators to ensure the prompt has only one correct answer that is complex, challenging and requires the use of tools.

Figure 4. Here, we show the various topics our prompts address. Many prompts require arithmetic operations and mathematical reasoning along with a somewhat uniform distribution of multiple disciplines ranging from Geography, Finance, History, Physics, Chemistry, Astronomy, Architecture etc. The topics are not mutually exclusive since many of these prompts span multiple domains and require multiple tools, multiple sources of knowledge and diverse forms of reasoning.

Label Creation

To create the process supervision labels as well as the final answer for each prompt, we utilized a hybrid human-AI approach. We start by prompting an LLM to outline a plan on which tools to call and in what order to get to the final answer. We then append this plan, which we call the Action Plan, to the sequence before using the LLM to formulate tool calls. This gives it a high level plan that it can try to execute with tool calls. We then use the LLM to call tools in the ReAct format. We chose this as it is the de facto standard for tool use and agentic workflows that combines reasoning and tool calls.

We asked the annotators to rate if a step is Correct, meaning that it is a reasonable step to getting closer to the final answer, or Incorrect, meaning the step is nonsensical, incorrect, or is not reasonable in getting to the final answer. The Action Plan as well as each of the ReAct steps must be marked as Correct or Incorrect by the annotator. If the annotator marks a step as Correct, the model is allowed to proceed further and generate the next step. If the annotator marks a step as Incorrect, they then edit the step to be correct. The model is then prompted to go further to the next step with the human-edited step as part of its context. This is repeated until the Finish Action is chosen by the LLM and marked as Correct by the annotator or until the annotator corrects an Action step to ‘Finish’ because we have enough information to answer the question. The overall flow is shown below in Figure 5:

Figure 5. The overall diagram of the human-AI collaborative process employed to create the process supervision labels and the final answer for each prompt.

After this process, our final data set includes one valid step-by-step chain of tool calls that successfully gets to the final answer, as well as potential incorrect sub-steps and each step. This yields process supervision labels with correct steps and incorrect steps for each prompt.

Note that for each tool subset, we only allow the LLM and the annotators to use the allowed tools for that subset.

Benchmark Metadata

Complexity. We define complexity as the number of tools required to solve a query/prompt. We show the complexity distribution, i.e. the number of tool calls, of ToolComp prompts in Figure 6.

Figure 6. About 85% of prompts in ToolComp require at least three tools to solve, indicating that they have a decent amount of complexity and difficulty. Furthermore, ~20% of prompts still require seven or more tool calls to solve. This indicates that an agent being evaluated on this benchmark requires high context length, sophisticated reasoning over long context, and advanced tool calling capabilities in order to process long tool chains, formulate a high level plan, and understand the outputs of each tool call to proceed to the next step and subsequently achieve a high score.

Answer Character Length. One simple way to quantify the final answers’ structure is to count their character lengths. We show the distribution of character lengths in ToolComp’s final answers in Figure 7. For more unambiguous evaluations, we focus on shorter character lengths with integers and short strings that are easy to verify.

Figure 7. Due to the nature of ToolComp needing to have answers that are easily verifiable, we choose to create prompts that have numbers and short strings to match. However, there are still some examples of prompts that require long structured outputs such as dictionaries, tuples and lists. These test the agent’s ability to follow complex queries that involve returning long outputs such as lists or dictionaries of city names, temperatures, altitudes, etc.

Answer Types. Below is the distribution of the following primitive data types: number, text, and date. We care most about evaluation of compositional tool use and reasoning rather than aesthetic output structuring and formatting. This is why the benchmark's labels are predominantly numeric while containing a significant fraction of string outputs. In many cases, strings and names are intermediary outputs, but we most often ask for numerical final answers to make the answer easier to unambiguously verify.

Figure 8. The figure above shows the distribution of various data types in our benchmark.

Evaluation Metric

We have two metrics to evaluate the quality or the correctness of a model’s final answers: LLM Grading and Exact Match. For our main leaderboard, we use LLM Grading to ensure that we only check whether the answer is correct without penalizing formatting issues. Our Exact Match evaluation methodology is shown below in Appendix D: Exact Match.

LLM Grading

By using LLM grading against ground truth answers we opt to be charitable to exact formatting and focus on assessing the tool use capabilities of the model. We intentionally choose not to focus on final answer formatting given that (1) there are existing benchmarks that assess formatting ability (e.g. FOFO) and (2) our final answers are quite complex, containing multiple elements, lists which may or may not be sorted, and dictionaries. This approach prompts an LLM Judge to look at the prompt, the ground truth answer, and the model’s answer and asks the model to classify it as Incorrect, Correct with bad formatting or Correct. We use GPT-4-Turbo as the de-facto judge for all of our models. The prompt used is shown in Appendix B: LLM Grading Prompt. We consider both Correct and Correct with Bad Formatting as a win (accurate) and Incorrect as a loss (inaccurate).

Evaluation Results

We evaluate different models in getting to the final answer and predicting our process supervision labels. Since evaluating final answers tests actual tool calling capabilities, we show these in our main leaderboard.

We acknowledge that each LLM was trained in a specific format to generate the arguments as well as process the outputs of arbitrary functions/tools. Under this assumption, we use native function calling for each respective model to give it a fair shot at performing its best on this benchmark.

For the avoidance of doubt, the evaluated model is what generates the high level action plan, performs native function calling, and derives the final answer. The grading model (GPT-4 Turbo in our case), is then used to evaluate the model’s output by comparing it against the ground truth answer.

As mentioned previously, we have 2 subsets to evaluate these models on:

  1. The first we call the “11-tool” or “ToolComp-Enterprise” subset, where the agent has access to all 11 tools shown in Table 2. This is called the “ToolComp-Enterprise” subset because the agents must choose from a larger set of more specialized, real tools, as would be common in enterprise LLM applications
  2. The second we call the “2 tool” or “ToolComp-Chat” subset, where the agents only have access to two tools: Google Search and Python Interpreter. We call this the ToolComp-Chat subset because leading chatbots are usually natively endowed with these two tools. In this setting, the LLM is tested on formulating search queries to find and retrieve relevant information as well as using that information in the Python Interpreter to do calculations, use symbolic solvers, process & manipulate data, write code, etc.

Final Answer Evaluation

The overall rankings shown in the leaderboard table and Figure 9 combine these settings and show the accuracy and 95% CI for each model in our leaderboard. We also show exact match evaluation numbers in Appendix D: Table 6 to ensure that our LLM Judge (GPT-4 Turbo) isn’t biased in favor of outputs from the same model family.

Figure 9. The accuracy of various models with their native function calling format and LLM Grading when using the respective tool setting for each prompt. This is the combined result of both subsets (both the 2 tool and 11 tool settings).

Figures 10 and 11 illustrate the accuracy breakdown for each subset. Notably, GPT-4o (Aug 2024) and Claude 3.5 Sonnet (Jun 2024) lead in scenarios where only Python and Google are permitted, showing their proficiency at using these tools very effectively. Conversely, o1-preview, GPT-4o (May 2024) and GPT-4 Turbo (Apr 2024) demonstrate greater versatility with new and unfamiliar tools as it tops the leaderboard in ToolComp-Enterprise. Note that these differences are still well within the confidence interval.

Figure 10. The accuracy of various models with their native function calling format and LLM Grading in the 2-tool subset, ToolComp-Chat.

Figure 11. The accuracy of various models with their native function calling format and LLM Grading on the 11-tool subset, ToolComp-Enterprise.

Process Supervision Evals

We further evaluate these models using our process supervision labels, aiming to assess each model's effectiveness as a pairwise judge in selecting the human-corrected step over the step generated by the original policy used during annotation. To mitigate position bias, we swap the order of the human-corrected and model-generated steps and conduct two separate predictions for each arrangement. Additionally, models are permitted to indicate a tie. If a model designates a tie at least once, or consistently predicts the same position (before and after swapping) for a given data sample, we classify the outcome as a tie. Mirroring the methodology used in RewardBench, we score losses as 0, ties as 0.5, and wins as 1. We show the results below in Table 2.

Table 2. Accuracy and the 95% CIs of all of our models on the process supervision labels in ToolComp. The third and fourth columns show the Total Accuracy and the 95% CI respectively. The fifth and sixth columns show the breakdown between Accuracy at predicting the human corrected ReAct Steps and the Accuracy at predicting the human corrected action plan respectively.

Error Analysis

In order to better understand the reasons behind each model’s failures, we come up with an Error Taxonomy and use GPT-4 Turbo to categorize the reasoning behind each failure. We inspect the individual failure cases predicted by GPT-4 Turbo and find that it is reasonably accurate. The different categories and their definitions are shown below in Table 3.

Table 3. The accuracy of various models with their native function calling format and LLM Grading is shown above. The black bars show 95% confidence intervals for proportions with a binomial exact match calculation.

Figure 12 shows a heatmap with the counts for each model and each category in our taxonomy. The major deduction is that Gemini 1.5 Pro (May 2024) and Command R Plus seem to frequently terminate unexpectedly. This is evident upon inspection as both these models tend to terminate early in many scenarios rather than continuing to call tools to progress towards a final answer. Additionally, all models often provide outputs with missing information, indicating a widespread issue with precise instruction following, especially in prompts requiring detailed information and intermediary outputs.

Figure 12. The breakdown of the various error categories in our taxonomy for each model (on the ToolComp-Enterprise). Particularly, Command-R Plus tends to incorrectly call tools and terminate too early before coming to a final answer. This causes it to not get a final answer for a majority of the examples.

Appendix

A. Seed Prompt

Please write a prompt that requires the use of all of the required operations and tools. Please be as creative as possible. The prompt should be hard enough so that only someone with a {grade} can solve it. Please do not specify the tools and operations to be used in the prompt. We want the assistant to figure out on it's own what tools to call so it should not be specified in the prompt itself. No phrases like "Use the ... tool" should be in the written prompt. The prompt should be a couple sentences. Make sure the prompt has only one possible answer that is concrete and easily verifiable. Make sure the prompt is very clear and isn't ambiguous in any regard. Make sure the answer is not in the prompt. Place [STOP] at the end of the prompt. Ensure the prompt requires implicit python execution and access to information on the internet (factual lookup) through Google Search in order to be answered. All prompts should have only one possible final answer, so avoid open ended statements like "Generate a report", "Develop a model", "Explain the implications" etc. The answer should be verifiable and concrete. Be specific about the all asks including, what exactly you want outputted, what format you want the answer in (such as lists, numbers, strings and any potential sorting you may require). Also be specific on what units you want it returned in.<|eot_id|><|start_header_id|>user<|end_header_id|><|eot_id|><|start_header_id|>user<|end_header_id|>

Before generating the prompt, come up with a high level process for generating a prompt that requires the use of dependent tool calls. The process should be written in the perspective of a person who is trying to generate a prompt that requires the use of dependent tool calls, that only has one possible answer with a well defined final answer format so that it's easy to verify through an exact string match. There should be no ambiguity in terms of what the prompt is asking for, including units, formatting, and what data to get.

Examples: {examples}

[BEGIN ALLOWED OPERATIONS] {operations} [END ALLOWED OPERATIONS]

[BEGIN REQUIRED TOOLS] {tools} [END REQUIRED TOOLS]<|eot_id|><|start_header_id|>assistant <|end_header_id|>

Be sure to make the prompt very creative, unique and different than any of the examples provided

Process: """



B. LLM Grading Prompt

You are an expert test grader. You have been given a student answer (‘Student Answer:’) to grade. You have also been the correct answer (‘Correct Answer:’) and the original question (‘Question:’). Each correct answer is a list of strings.

We want to make sure the values in the student answer are returned in the order that they are asked in ‘Question: ’.

For example,

if ‘Question:’ asks for a city name, its temperature and a list of names of all the NBA teams whose home stadium is within a 400 mile radius, and ‘Correct Answer:’ is ['San Francisco', 78, ['Golden State Warriors’, 'Los Angeles Lakers']] we would want ‘Student Answer:’ can be ['San Francisco', 78, ['Los Angeles Lakers', 'Golden State Warriors']].

Examples: Question: Find the name of the city known for its famous tourist attraction Alcatraz, also give it’s current temperature and a list of names of all the NBA teams whose home stadium is within a 400 mile radius Correct Answer: ['San Francisco', 78, ['Golden State Warriors’, 'Los Angeles Lakers'] Student Answer: ['San Francisco', 74, ['Los Angeles Lakers', 'Golden State Warriors']] Reasoning: The Student Answer is correct because it identifies the same city , the temperature is within 10% of the Correct Answer and the same team names are present in the list. Final Grade: CORRECT

Question: Find the name of the city known for its famous tourist attraction Alcatraz, also give it’s current temperature and a list of names of all the NBA teams whose home stadium is within a 400 mile radius Correct Answer: ['San Francisco', 78, ['Golden State Warriors’, 'Los Angeles Lakers'] Student Answer: The city name is San Francisco, its temperature is 80 degrees and the Los Angeles Lakers and the Golden State Warriors are two NBA teams whose home stadium is within a 400 mile radius Reasoning: Although the Student Answer is correct (identifies the same city , the temperature is within 10% of the Correct Answer and the same team names are present), it’s not formatted the same and contains extra text and natural language. Final Grade: CORRECT BUT BAD FORMATTING

Question: Find the name of the city known for its famous tourist attraction Alcatraz, also give it’s current temperature and a list of names of all the NBA teams whose home stadium is within a 400 mile radius Correct Answer: ['San Francisco', 78, ['Golden State Warriors’, 'Los Angeles Lakers'] Student Answer: ['San Francisco', -15, ['Los Angeles Lakers', 'Golden State Warriors']] Reasoning: The Student Answer is incorrect because although identifies the same city and the same team names are present in the list, the temperature is well outside of 10% of the Correct Answer. Final Grade: INCORRECT

If ‘Question:’ asks for a special sorting of the list, make sure ‘Student Answer:’ is sorted the same as ‘Correct Answer:’. So if ‘Question:’ instead asked to list the names of all the NBA teams whose home stadium is within a 400 mile radius in alphabetical order, we would want ‘Student Answer:’ to contain ['San Francisco', 78, ['Golden State Warriors’, 'Los Angeles Lakers']].

Examples:

Question: Find the name of the city known for its famous tourist attraction Alcatraz, also give it’s current temperature and a list of names of all the NBA teams whose home stadium is within a 400 mile radius in alphabetical order Correct Answer: ['San Francisco', 78, ['Golden State Warriors’, 'Los Angeles Lakers'] Student Answer: ['San Francisco', 75, ['Golden State Warriors’, ‘Los Angeles Lakers’]] Reasoning: The Student Answer is correct because it identifies the same city , the temperature is within 10% of the Correct Answer and the same team names are present in the list and sorted the same given the Question asks for a special sorting (alphabetical order). Final Grade: CORRECT

Question: Find the name of the city known for its famous tourist attraction Alcatraz, also give it’s current temperature and a list of names of all the NBA teams whose home stadium is within a 400 mile radius in alphabetical order Correct Answer: ['San Francisco', 78, ['Golden State Warriors’, 'Los Angeles Lakers'] Student Answer: The city name is San Francisco, its temperature is 80 degrees and the Golden State Warriors and the Los Angeles Lakers are two NBA teams whose home stadium is within a 400 mile radius (in alphabetical order) Reasoning: Although the Student Answer is correct (identifies the same city , the temperature is within 10% of the Correct Answer and the same team names are present in the same order), it’s not formatted the same and contains extra text and natural language. Final Grade: CORRECT BUT BAD FORMATTING

Question: Find the name of the city known for its famous tourist attraction Alcatraz, also give it’s current temperature and a list of names of all the NBA teams whose home stadium is within a 400 mile radius in alphabetical order Correct Answer: ['San Francisco', 78, ['Golden State Warriors’, 'Los Angeles Lakers'] Student Answer: ['San Francisco', 79, ['Los Angeles Lakers', 'Golden State Warriors']] Reasoning: The Student Answer is incorrect because although it identifies the same city and the temperature is within of 10% of the Correct Answer, the list of teams name is in a differently ordered even though ‘Question:’ explicitly asks for a special sorting. Final Grade: INCORRECT

Ideally, the student answer, should only output the final answer with no additional text or natural language.

The possible grades are:

INCORRECT: ’Student Answer:’ is different from ‘Correct Answer:’

  • numbers are completely different
  • lists are completely different
  • Question:’ asks for special sorting of a list but the list in ’Student Answer:’ is sorted differently than ‘Correct Answer:’
  • strings are completely different or information present in the string is completely different

CORRECT BUT BAD FORMATTING: ’Student Answer:’ has the same info as ‘Correct Answer:’ but is formatted differently..

  • ’Student Answer:’ includes natural language or additional text
  • numbers are formatted differently but they are close to one another (’Student Answer:’ is within 10% of Correct Answer:’, ’Student Answer:’)
  • lists are wrapped differently than the correct answer but contains the same information and sorted the same way as ‘Correct Answer:’ if asked “Question:’ asks for a special sorting
  • Strings are the same but may be formatted differently

CORRECT: The student answer has the same info as ‘Correct Answer:’ and is also formatted the same as ‘Correct Answer:’

  • numbers are close to one another (’Student Answer:’ is within 10% of the correct answer)
  • if ‘Question:’ asks for a special sorting of the list the ’Student Answer:’ list is sort the same as ‘Correct Answer:’
  • lists are wrapped the same
  • Strings are identical

Remember you are assuming the correct answer provided is correct, your job is is only to compare the correct answer to the student answer and grade it based on the above criteria. Do not try to determine the correct answer yourself. Make sure to include a reasoning and final grade in the format:

Reasoning: <reasoning> Final Grade: <INCORRECT/CORRECT BUT BAD FORMATTING/CORRECT>[ENDOFGRADE]

Now do this for the following user provided question, student answer and correct answer.



C. IC Example

in_context_example = """I wanna know if eating meat is correlated with heart issues, find the annual per capita consumption of meat in (kg/person) and also the per capita heart attack rates (in heart attacks/person) for every country. Then run a linear regression with y as heart attack rates and x as meat consumption, return the Pearson’s correlation as well as the slope of the fit line."""

process = """I will first start by creating a prompt that requires the use of google search. I want to make this prompt about investigating whether the amount of meat you consume is correlated to heart disease. In order to make sure there is only one possible answer, I will ask to find the per capita consumption of meat (in kg/person) and heart attacks rates (heart attacks per person) in all countries. This standardizes the actual data that needs to be pulled and specifies the units to ensure there is only one possible answer. I will then ask for a linear regression using that data since it requires a python interpreter. Since linear regression is deterministic when the data is fixed and the data required to fit the linear regression is well defined, I can ask to output it's parameters and ensure there is only one possible answer that can be returned. This ensures that the good prompt is clear, unambiguous and has an answer that is easy to verify through an exact string match while also requiring a chain of dependent tool calls (google search call, then python interpreter call) to solve."""

A single IC example and its process/reasoning. Both are used in the synthetic prompt generation process.

D. Exact Match

This paradigm aims to assess both the tool use capabilities and the instruction/format following capabilities of the model. Formatting is particularly important when we want to use the LLM to automate a backend process. This paradigm programmatically evaluates unsorted lists (eg. prompt asks for a list of all states in the US), sorted lists (eg. prompt asks for a list of all states in the US in alphabetical order), numbers (eg. prompt asks for the areas of Texas in square miles), and strings (eg. prompt asks for the name of the football team that won the Superbowl in 2016).

  1. Unsorted lists are sorted and exact matched (set match gets rid of duplicates)
  2. Sorted lists are exact matched
  3. Number are checked if they are within a tolerance param (the tolerance param is to account for variance among different sources online)
  4. String are stripped, lower cased, and exact matched

The exact match accuracy numbers of our different models. The overall ranking stays the same with some minor differences due to a discrepancy between tool use capabilities and instruction following/formatting capabilities. Particularly, we notice that Command R + consistently gives the final answer in natural language rather than the format prescribed in the prompt. We also notice that Gemini 1.5 Pro (August 27, 2024) ranks better with exact match because it is much better at outputting the desired format prescribed in the prompt. o1-preview also tends to wrap the final answer differently than intended and adds intermediate outputs that weren’t asked in the prompt.
Model
ScoreChat ScoreEnterprise Score
61.83 ± 4.34
55.1 ± 6.96
66.43 ± 5.47
58.68 ± 4.39
56.85 ± 6.92
59.93 ± 5.67
58.44 ± 4.38
49.5 ± 6.96
64.58 ± 5.52
58.03 ± 4.39
56.06 ± 6.91
59.38 ± 5.67
57.61 ± 4.39
53.03 ± 6.95
60.76 ± 5.64
56.61 ± 4.41
51.27 ± 6.98
60.28 ± 5.66
51.03 ± 4.44
48.49 ± 6.96
52.78 ± 5.77
48.56 ± 4.44
40.4 ± 6.84
54.17 ± 5.78
46.30 ± 4.43
40.4 ± 6.84
50.35 ± 5.78
46.19 ± 4.44
40.1 ± 6.84
50.35 ± 5.78
45.89 ± 4.43
37.88 ± 6.78
51.39 ± 5.77
44.03 ± 4.41
32.83 ± 6.54
51.74 ± 5.77
38.43 ± 4.34
35.5 ± 6.568
40.42 ± 5.68
35.74 ± 4.27
33.5 ± 6.59
37.23 ± 5.6
26.13 ± 3.91
20.2 ± 5.59
30.21 ± 5.3
12.81 ± 2.98
6.09 ± 3.34
17.42 ± 4.39