Evaluating Performance of LLM Agents

LLM Agents

Large language model (LLM) agents are a prompting strategy for LLMs in which the LLM controls the execution flow and can invoke tools to accomplish its objective. Just as chained prompts extend the capabilities of single prompts (e.g., by adding fact-checking to a summarization prompt), agents extend the capabilities of fixed prompt sequences. To do so, the agent prompt must include instructions telling the LLM how to format its response such that the response can be interpreted as an action.

Take, for example, the ZeroShotAgent from LangChain. It prompts the LLM with the following template:

"""
Answer the following questions as best you can. You have access to the following tools:
{tool_name}: {tool_description}

Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Question: {input}
Thought:{agent_scratchpad}
"""

In this scheme, all tools must accept a single-string input and return a single-string output. The LLM output is interpreted as an action, and the LLM is called repeatedly until it responds with a final answer, which is returned to the user.

Suppose a user asks the agent “What is the next leap year raised to the 0.43 power?” and that the agent has access to two tools, “Search the Web” and “Run Python Script.” The prompt would look like the following, where the LLM was called two times, each time with the full prompt so far.

"""
Answer the following questions as best you can. You have access to the following tools:

search_the_web: search_the_web(query: str) -> str - Search online for websites matching your query
run_python_script: run_python_script(script: str) -> str - Run a python script and return stdout. If you want to see the output of a value, you should print it out with 'print(...)'. 

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [search_the_web, run_python_script]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: What is the next leap year raised to the 0.43 power?
Thought: I can search the web for a list of leap years and find the most recent one. Then I can raise it to the power of 0.43.
Action: search_the_web
Action Input: "next leap year"
Observation: Tool output was: "The next one is February 29, 2024. Leap Year Rules: How to Calculate Leap Years In our modern-day Gregorian calendar, three criteria must be taken into account to identify leap years: According to these rules, the years 2000 and 2400 are leap years, while 1800, 1900, 2100, 2200, 2300, and 2500 are not leap years. Why Do We Have Leap Years?"
Thought: The most recent leap year is 2024. Now I can raise it to the power of 0.43.
Action: run_python_script
Action Input: "print(2024 ** 0.43)"
Observation: Tool output was: 26.404048299525233
Thought: The most recent leap year raised to the power of 0.43 is approximately 26.404048299525233
Final Answer: 26.404048299525233
"""

Notice how the agent chose to first use the search tool twice, then invoke the calculator, and, finally, provide a final answer based on the tool outputs. Perfect! The power of agents is both in the tool use and in the conditional flow. Imagine we had asked the agent simply for the next leap year and not the year raised to the 0.43 power. The agent would not have needed to use the Python tool and could have responded with our answer immediately after using the web search tool.

The Design Space

The example above is simple and this particular prompting strategy is better for demonstration than production. For one thing, the ZeroShotAgent was made to work with a traditional LLM but doesn’t work well with chat-style LLMs, where there is a back-and-forth interaction between the user and the LLM. Also, if you want to provide a chat interface with which you will interact with the agent, you need to decide how to handle the chat history–both where to put it in the prompt, and how to disambiguate the agent’s conversation with the user from its interactions with tools. Finally, if you want to move beyond a strict string-to-string interface for tools, you have to decide how to handle structured action inputs.

If you want to use powerful chat-based models such as ChatGPT, GPT-4, or Anthropic’s Claude, you need chat-aware prompts. Strategies that are helpful with traditional LLMs, such as providing the correct opening tokens (“Thought”: in the example above, or an open curly brace for use with JSON action formats) don’t mix well with the chat-style models. Additionally, these models tend to be inclined to be conversational and add dialogue to their responses (e.g., “Sure, I can help you with that…”). Therefore, a robust agent implementation needs to be able to parse the LLM output even in the presence of this small talk. As powerful as those models are, one drawback is the loss of token-level control of responses. With traditional LLMs, tools such as JsonFormer and Microsoft’s Guidance package can be used to guarantee your output is properly formatted, notably simplifying the necessary parsing and error handling.

Another key design decision lies in the expected LLM output format. You must provide instructions and/or examples in the prompt such that the LLM reliably outputs text in the expected format, then you must robustly parse that output so you can properly interpret the LLM’s decision. In addition to the simple format used by ZeroShotAgent above, one common choice is to use JSON. Although less common, alternatives such as YAML and TOML could work, too. LangChain’s ChatAgent and ConversationalChatAgent both use markdown code blocks containing JSON as their action format, instructing the model to use the following format.

"""
```json
{{
    "action": string, \\ The action to take. Must be one of {tool_names}
    "action_input": string \\ The input to the action
}}
```
"""

Evaluating Performance

With all these considerations that go into the agent implementation and prompt engineering, and the ability to mix agents and LLMs, how can you meaningfully compare your options? How can you be confident a prompt change improves performance overall? To address this challenge, Scale engineers implemented a lightweight framework for running agents through unit-test-like scenarios designed to test the agents’ performance, especially their decision-making ability and competence with structured outputs. These scenarios are flexibly defined, using assertions to verify the proper agent output and intermediate steps. And, importantly, the framework allows us to easily dig into failure modes to understand how agents fail, and when tests need to be updated to accommodate unexpected, but acceptable, agent behavior.

Our framework builds on the LangChain abstractions of LLM, Agent, and AgentExecutor. Additionally, we define a scenario to be an asynchronous function that exercises the agent and LLM (and optionally the AgentExecutor) and raises an AssertionError if the behavior is unacceptable. We use CallbackHandlers to provide verbose output when desired and make assertions about the agent’s intermediate steps depending on the scenario.

Example Results

To demonstrate, we tested ConversationalChatAgent powered by three different LLMs (Claude-1.3, GPT-3, GPT-3.5, and GPT-4) using a selection of Q&A-based scenarios with known answers. We chose ten scenarios of moderate complexity and ran each twenty times. The success rates are shown in the table below. A brief description of each test can be found in the appendix.

	ConversationalChatAgent
	Claude	GPT-3	GPT-3.5	GPT-4
american_history	100%	100%	100%	100%
binomial	0%	0%	0%	30%
endothermic	100%	0%	0%	100%
klarna	0%	0%	0%	90%
movies	0%	0%	100%	35%
simple_math	100%	0%	100%	100%
three_n_plus_one	0%	0%	0%	85%
time_in_canberra	70%	45%	100%	95%
velocity	0%	100%	45%	100%
world_cup	100%	80%	100%	80%
Total	47%	33%	55%	82%

It is not surprising that on average GPT-4 outperforms GPT-3.5, but we might have expected Claude to score better (based on the LMSYS chatbot arena leaderboard, for example). However, some investigation shows that ConversationalChatAgent was likely developed for optimal performance with GPT 3.5. For example, the opening line of ConversationalChatAgent’s prompt says “Assistant is a large language model trained by OpenAI.”

To better understand when and why these models fail, let’s take a closer look at the three_n_plus_one scenario. In this scenario, we ask the agent "What is the fourth number in the 3n+1 sequence whose first number is 43?" We expect the model to infer that we mean the sequence associated with the Collatz conjecture and we provide it with a Python REPL in which to run a script that prints the answer. That result is provided to the agent as the tool output and the agent must make a subsequent LLM execution in order to format its response to the user. GPT-4 was able to solve this problem, responding, in one example, with the following generated text:

"""
I will calculate the fourth number in the 3n+1 sequence whose first number is 43.

```python
def collatz_sequence(n):
    sequence = [n]
    while n != 1:
        if n % 2 == 0:
            n = n // 2
        else:
            n = 3 * n + 1
        sequence.append(n)
    return sequence

sequence = collatz_sequence(43)
fourth_number = sequence[3]
fourth_number
```

Now, I will run the Python script to find the fourth number in the sequence.

```json
{
    "action": "run_python_script",
    "action_input": "def collatz_sequence(n):\n    sequence = [n]\n    while n != 1:\n        if n % 2 == 0:\n            n = n // 2\n        else:\n            n = 3 * n + 1\n        sequence.append(n)\n    return sequence\n\nsequence = collatz_sequence(43)\nfourth_number = sequence[3]\nprint(fourth_number)"
}
```
"""

It generated a correct Python code block outside of the JSON response, then repeated that code within the JSON, with properly escaped newline characters.

GPT-3.5 responded similarly but with an off-by-one error causing it to output the fifth number in the sequence, 98.

"""
{
    "action": "run_python_script",
    "action_input": "def sequence(n):\n    if n % 2 == 0:\n        return n // 2\n    else:\n        return 3 * n + 1\n\nnum = 43\nfor i in range(1, 5):\n    num = sequence(num)\nprint(num)"
}
"""

Claude also failed this test on every attempt, but a look at the output shows it was close to getting it right. Claude responded with only the JSON code block, as instructed, and followed the desired format. It even provided a Python script that would have produced the correct result! But it failed to escape the newline characters, leaving the JSON improperly formatted.

"""
```json
{
    "action": "run_python_script",
    "action_input": "n = 43
sequence = [n]
while n != 1:
    if n % 2 == 0:
        n = n / 2
    else:
        n = 3*n + 1
    sequence.append(n)
print(sequence[3])"
}
```
"""

This challenge originates in the choice to use JSON as the action format. In contrast, YAML and TOML both have native support for multiline strings. On the other hand, it would be easy to assist Claude by escaping the newline characters within the string on its behalf. So, you might say this was a failing of the ConversationalChatAgent as much as it was a failing of Claude.

We see a similar case of shared blame in the klarna scenario, taken from the LangChain documentation for plugins. In this scenario, the agent must invoke the “KlarnaProducts” tool, which takes no inputs, to see the OpenAPI spec for interacting with Klarna’s API. The ConversationalChatAgent prompt is not clear on the expected format for tools with no input and while GPT-3.5 and GPT-4 both choose to provide a meaningless input, which is ignored, Claude makes the seemingly reasonable choice to omit the “action_input” key altogether, causing an error in the action parsing step.

And lastly, in the binomial scenario, in which all three LLMs resulted in test failures, the Agent is given the following instructions. "An unbiased coin is tossed 10 times. Find the probability that there are fewer than 3 heads tossed." Again, Claude fails to format the Python script properly within the JSON code block. GPT-3.5 and GPT-4 generate valid scripts within valid JSON, but both compute the probability of three or fewer heads rather than strictly fewer than three.

Building Better Agents

When developing our LLM assistant, Scale engineers incorporated lessons learned from tests like those above in order to write more challenging scenarios and to develop agents that could successfully complete them. Additionally, we used this evaluation framework to encourage strong agent performance across multiple LLMs. The result is a specialized internal agent, used in Scale’s Donovan product, that can perform complex tool-based tasks with high rates of success.

To demonstrate, we evaluated our Donovan Agent using tasks such as Plotly chart generation, SQL querying, map generation, and answer generation with proper citations (see Appendix: Donovan Tests). We ran each test twenty times, and the success rates are shown in the table below. Some of the tests, such as top_5_descending, are intentionally difficult due to the multiple steps required. In that particular scenario, the agent must infer the user’s intent from the provided conversation history, look up the correct SQL schema, generate the necessary SQL query, and then create a Plotly visualization of the results.

	Donovan Agent
	Claude	GPT-3	GPT-3.5	GPT-4
hello_plugins	100%	0%	20%	100%
kasserine_map	100%	100%	100%	100%
los_chapitos	95%	5%	100%	100%
plotly	100%	100%	100%	100%
top_5_descending	0%	45%	80%	0%
total_gain	100%	10%	100%	100%
Total	82%	43%	83%	83%

We can see from the evaluations across these different tasks that custom and careful agent construction has leveled the playing field across Claude, GPT-3.5, and GPT-4.