Research

Toolchaining: The Problem No One is Talking About

At Scale, our Enterprise research team focuses on testing novel approaches to real-life scenarios, grounded in real customer data, so that we can further capability advancements that are actually applicable and useful in scaled, enterprise contexts.

We take our findings from experiments like the one we share in today’s blog and codify them into our MLE tool suites so they can immediately put the results into practice and meaningfully improve delivery for our enterprise clients. In the case of the following experiment, we developed a PlanByPythonExecutor tool that is already being used in production. 

Intro

Imagine you’re an edtech, looking to train your model on a set of question and answer pairs from previous students across multiple domains like Math, English, and History. The issue? You’re parsing Q&A data from all kinds of legacy files: websites, textbooks, and more, leading to unstructured HTML, XML, and free text jargon. Your model can’t parse the Q&A pairs. It needs a cleaned version of this dataset. 

Converting content and datasets into human-legible or vice versa, to LLM-legible data, is an open and interesting topic with a lot of potential for discovery. In fact, Andrej Karpathy discussed the challenges of the "LLMification" of human-readable education content on X recently.

A sample snippet of your messy, unparsed Q&A data set

So what do you do? You’re most likely shouting: use an LLM to clean up the file! There are two options to doing this:

  1. Ask an LLM to read each question-answer pair and rewrite them in a clean way

  2. Ask an LLM to write a script to parse all the question-answer pairs at once using a standardized library like beautifulsoup

On option 1, we found that LLMs are prone to getting distracted by large amounts of context, missing key details, failing to persist tables, misinterpreting tags, and other issues.  

In this blog, we mostly explore option 2 which introduces its own set of challenges such as needing to learn how to handle edge cases, writing the proper formatting code, allowing the user to equip the LLM with helpful utilities and more. We’ll go into detail about a method we developed to use LLMs to call multiple tools to solve this problem. 

The key insight is we didn't need to rely on the LLM to actually do the cleaning of the data, we just needed it to come up with a plan on how to clean the data. The plan can then be executed, and the feedback circulated back to the LLM in the form of formatted examples, failures etc so that the LLM can refine the plan.

We found the standard approach to toolchaining insufficient. Simply giving the LLM access to multiple tools and asking it to clean the data led to extremely poor results, if it could do it at all. Instead, when we gave the LLM access to a python sandbox pre-loaded with these tools and asked it to develop the aforementioned plan, the output was significantly improved. For the remainder of this blog, we dive into why this happened, how we set up our experiment, and what the findings mean for you. 

So what’s the problem with the standard approach to toolchaining?

You might be thinking “Surely this data formatting task is quite simple for an LLM?”. We actually found that it was anything but! 

A reasonable first attempt to solve this task would be to provide the LLM with a set of tools (eg: a bunch of formatters, utilities to read the data, etc.) and let it figure out how it wants to use them. However, we found this approach breaks down for anything more interesting than a toy dataset. The reason for this is the following: in this scenario, the LLM is responsible for the orchestration of the tools. As a result, all tool responses are routed back to the LLM, which then has to provide the output of this tool call as input to the next tool call. 

While flexible, this method often becomes costly and error-prone. As outputs move from tool to tool, the LLM can unintentionally rewrite inputs, such as injecting random strings, altering formats, or losing critical information. In turn, this creates downstream errors, bloats the context window, and is something the LLM simply does not need to do.

Consider the following example: let's say the output of a tool call is a complex, nested, 20000-character HTML string. Asking the LLM to reproduce it exactly as input for the formatter is fraught with danger - even a single missing character/tag would break the formatter downstream! Interestingly, in their new gpt-oss release, OpenAI tacitly acknowledges this challenge of tool-call chaining. To train this suite of models, they use a Harmony Response format that provides a routing mechanism which will allow tool call outputs to be redirected to receivers other than the agent that called the tool, although they don’t show any examples of doing this yet. This futureproofs this message format to scenarios where one needs to compose tool calls (tool_call_A output -> tool_call_B input) or have a separate agent understand the tool call output (tool_call_A output triggered by Agent X -> message for Agent Y)

 

To address this problem, we asked the LLM to formulate a plan and enable toolchaining

Motivated by how we as humans approach large, daunting tasks, we instead asked the LLM to formulate a plan. A Plan Executor Tool then carries out this plan and provides only relevant, curated context as feedback. In this way, the LLM is still able to refine its plan based on how well it performed, but is no longer required to orchestrate the toolchaining, and nor is it throttled by overwhelming context. 

We found that LLMs are exceptionally strong at generating these plans and can reason through tool-based workflows, but their performance depends heavily on how tools are exposed and how plans are structured. We tested three distinct approaches to generating these plans: Plan as JSON with toolchaining, Plan as Python with tool injection, and Plan as Python without tool injection, aka a stateful code executor. We tested against a baseline of an MLE writing custom parsing functions to clean the provided data.

By enabling the LLM to generate a Python-based plan and leverage pre-built tools from our subject-matter experts and ML engineers, we achieved results nearly as accurate as human-only workflows at a fraction of the cost and time. Next, we dive into the details for the approach to the plans. 

We tested three distinct options for plans 

Having established the need for a plan, the question then becomes what format this plan should take. We explored three options: Plan as Python with no tool injection, Plan as JSON, and Plan as Python with tool injection. We tested these three options against a baseline, handwritten set of custom parsing scripts.

Baseline: No LLMs, Handwritten, Custom Python Parsing Functions

To test the plan’s effectiveness, we first set a baseline with no LLM intervention. An engineer wrote custom parsing functions to clean the data. Unsurprisingly, this process proved highly effective but took a full week to develop and test. The functions were also bespoke to this dataset, making them tightly coupled to the provided data.

Plan as JSON with Tool Injection

The Plan as JSON consists of a name, a description and a list of steps that the executor must carry out. This might look like:

{
"name": "LLMPlan",
"description": "This plan is designed for a data formatting task",
"steps": [
{
"name": "load_csv",
"action": "load_csv",
"params": {"path": "input_data.csv"}
}
...
{
"name": "run_python_code",
"action": "run_python",
"params": {"code": "<LLM generated code>"}
}
...
{
"name": "save_formatted_csv",
"action": "save_csv",
"params": {"path": "output_data.csv", "df": "$stepk.result"}
}
]
}

The action in each step could be a tool that the LLM has access to, or python code that the LLM writes on its own. The crux is that, via this json, the LLM is effectively able to compose a sequence of tool calls by referencing results of previous steps without ever receiving the output explicitly.

We found that the LLM was able to accurately reason about the control flow (e.g., iterating over data, conditionally selecting formatters) and revise plans after failed executions. Overall the method proved workable when the LLM had access to a basic Python executor capable of invoking formatter tools. When analysing the traces of the model though, we noticed that most of the heavy lifting was happening within the python tool. This prompted the question: are we throttling the LLM with enforcing such a rigid JSON plan structure? What if, instead, we allowed it to write an arbitrary plan as python code?

Plan as Python with Tool Injection

Next, we allowed the LLM to break free from its Plan as JSON shackles and write arbitrary python code. Where this differed from a vanilla code-interpreter tool was that we injected all the tools that the LLM would otherwise have at its disposal into its runtime. So instead of the classical tool-calling paradigm, we allow the LLM to focus on writing good code, while providing it with the inductive bias of certain utilities it can leverage (e.g. HTML formatters written by one of our MLEs). This seemed an especially natural segue given how models are getting more specialised and adept at writing code.

This approach still provided us the ability to not task the LLM to do things it isn't needed for, but at the same time give it flexibility to structure the plan in whatever shape it wants! With minimal human input, the LLM was able to leverage the provided utilities to format the data to a level comparable to that implemented by our engineers, often revising scripts and fixing bugs as it understood the data better. 

Plan as Python (aka Stateful Code Interpreter)

In this experiment, we give the LLM access to a single tool: a stateful code interpreter and ask it to carry out the task. We wanted to see if the utilities were indeed useful or if the LLM does best with a clean slate.

While the LLM was able to implement reasonable formatters, the quality of output in most cases was mostly inferior to that of the other methods that leveraged the utilities. Interestingly, however, the LLM ended up implementing a completely new formatter for a question type we had not previously identified and significantly improved the quality of the question-answer parsing as a result. 

 

Here’s what we found

In this experiment, we used LLM-as-a-judge to evaluate two things:

  1. Completeness: Whether the LLM was able to extract the full question-answer pair (e.g.all conditions qualifiers should be present in the question)

  2. Formatting: Were all parts of the question + answer formatted correctly, accounting for different format types like tables etc

 

Overall, the baseline approach with handwritten parsing functions produced the best results. This is unsurprising given the logic was specifically developed for and tested on this particular dataset. However, Plan as Python with Tool Injection and Plan as JSON were both able to effectively leverage these handwritten tools to yield results that were almost as good. While the two plans did not differ meaningfully in performance compared to one another, Plan as Python provides more flexibility.

 

The Plan as Python (aka Code Interpreter Tool) which didn’t have access to our handwritten parsing functions overall performed slightly worse in our LLM-as-a-judge evaluations. Interestingly however, the LLM ended up implementing a completely new formatter for a question type we had not previously identified and significantly improved the quality of the question-answer parsing as a result. Thus, we show that overall the Code Interpreter Tool was inferior in our internal LLM-as-a-judge evaluations but for certain question types, it significantly outperformed all other approaches.

Here’s a look at the results below:

To recap, we found that injecting MCP servers and tools into a python session that the LLM can write code in is very powerful. In general, we have shown the ability to get LLMs help us do this data formatting task and in a limited capacity, LLMs have found things that were better than a human.

 

What does this mean for you?

If you’re tackling messy data formatting at large volumes and want to use LLMs, here’s the takeaway: planning-based with toolchaining approaches work really well. They let an LLM break down a complex job into smaller steps and use the right tools along the way without asking the LLM to do things it isn’t suited for or in fact required to do. 

 

  • Plan as Python is flexible and works well for open-ended tasks. You can give it your own helper tools or logic, and it will figure out how to use them.

  • Plan as JSON is more rigid but easier to read and debug. It can still tap into the same execution engine as Plan as Python when you need more flexibility.

 

Going forward, we’re looking at adding a feedback loop: a simple Python executor could write small helper functions for tricky question types, and we’d feed those back into Plan as Python with Tool Injection. That way, we can get close to the accuracy of handwritten parsing while keeping the workflow general and reusable.

 

What’s next

Thanks to this work, we now have access to our Plan as Python tool to bring to our various engagements allowing LLM Agents to freely use premade enterprise SME or Scale MLE defined tools. As a result, we’ve experienced a significant reduction in the cost of cleaning up messy data (something we know we’ll always need when working with enterprises 😀)

We’ve integrated these planning and tool-injection capabilities into our internal MLE toolkit via a robust Stateful Python Executor Tool. This will allow others across Scale to prototype similar workflows on live customer data with minimal setup. Future directions include extending this framework to more complex agent loops, supporting self-healing pipelines (e.g., retrying failed steps), and fine-tuning our models on the traces by the LLM while generating these plans to further refine their ability to carry out tool-enabled loops.

Every day it seems like there’s some new “breakthrough” with AI agents and while it’s true that the pace of new agentic capabilities is astonishing, as an enterprise, what matters is whether those breakthroughs can actually work in an enterprise context. For our Enterprise research team, that’s our differentiator. We start with real customer use cases and test advanced approaches against real customer data. Those learnings are then codified back into our platform, so every research finding becomes part of our delivery pipeline for customers, like these results of our data formatting toolchaining experiment. 

Cleaning data is just one small step in the journey toward adopting AI in production. At Scale, we specialize in partnering with leading enterprises to tackle hard, high-impact problems like these. If you’re struggling with a similar issue or just find this research interesting, reach out to our team at scale.com/demo. You can also learn more at scale.com/enterprise/agentic-solutions.


The future of your industry starts here