VisualToolBench (VTB), a new benchmark from Scale, reveals that models are still much better at "thinking about images," rather than "thinking with images." The distinction is important. "Thinking about images" involves processing and reasoning over images without modifying them to better understand them. "Thinking with images," on the other hand, means being able to modify them for clarity through cropping, editing, or enhancing to uncover details and solve problems. Images become a "manipulable cognitive workspace."
The results of VTB are striking: no model broke 19% correctness, and a majority were below 10%.
At a glance, VisualToolBench is composed of:
1,204 tasks
2,893 images
Single and multi-turn interactions
Five domains: STEM, medicine, finance, sports, and general topics
Measurements of: visual understanding, truthfulness, instruction following, reasoning, and presentation.
This post will break down what makes the study uniquely valuable, the surprising reasons why even the best models failed, and what this means for the future of AI.
VTB is the first benchmark designed to systematically evaluate Multimodal Large Language Models (MLLMs) on how well they “think with images,” requiring models to manipulate images to solve problems.
Unlike previous tests that simply included images, VTB presents models with imperfect, realistic scenarios where key information is often hidden, rotated, or poorly framed. To succeed, a model can't just passively describe what it sees; it must actively use tools to transform the image into a more useful format.
The benchmark’s credibility is reinforced by its design, which was built on four key principles:
The tasks were authored by human domain experts and then put through a tough filtering process. A task was only included if at least two of three top-tier MLLMs failed it during pre-grading, guaranteeing a benchmark that is genuinely challenging for even the most advanced models.
The skills tested by VTB are critical for the high-stakes applications where AI is increasingly being deployed. The benchmark's tasks mirror real-world scenarios where a simple failure in perception could have serious consequences.
Examples of these challenges, taken directly from the paper, include:
High-Stakes Medical Analysis: A model must analyze an image from a live chest operation to assess the correct placement of a surgical instrument , a task that requires identifying life-threatening risks such as injuring the patient's heart or major blood vessels.
Complex Engineering Problems: A model is required to solve a structural engineering problem by interpreting a messy, hand-drawn diagram from a notebook and applying the correct formulas.
Demonstration example from VisualToolBench (single-turn, generalist domain, region switch Q&A). The key visual content needed to solve the task is distributed across different regions of the image, requiring the model to crop multiple regions of interest (RoIs) for accurate perception and reasoning. Each task is paired with a detailed set of rubrics to evaluate model’s responses From these rubrics, we derive both a weighted rubric score between 0 and 1 and a binary pass/fai outcome, depending on whether critical rubrics are satisfied.
The benchmark’s low scores are a direct result of its strict and nuanced grading system. The primary metric is the Average Pass Rate (APR), and a model’s response only earns a “pass” if it satisfies ALL of the essential rubrics. There is no partial credit for the final pass/fail outcome. Failures tended to cluster around two specific weaknesses: flawed perception and inefficient tool use.
The most significant finding from the study’s error analysis is that the models’ core reasoning abilities are not the primary problem. Instead, the bottleneck is a more fundamental skill: visual perception. Across the three representative models analyzed, "visual perception errors"—the inability to correctly identify, interpret, or extract relevant information from an image—were the most common failure mode, accounting for 71% to 82% of all mistakes . This shows that even with powerful tools at their disposal, the models often struggled to see the necessary details in the first place.
VTB also revealed that simply giving a model access to tools does not guarantee it can use them effectively. Quality and precision of tool use are far more important than the number of attempts at using them. o3, for example, made the largest number of total tool calls at 16,116 and was outperformed by GPT-5, which used its 10,212 calls more efficiently. Strangely enough, tools were not even always useful; Gemini-2.5-pro’s performance improved by 2.7% when it had no access to tools.
VTB should be seen as the first roadmap for the next generation of AI. By diagnosing the gap between passive sight and active perception, it provides a clear and actionable guide for developers. The path to more capable and reliable AI now involves teaching these models a new fundamental skill: thinking in images. So in the future, AI will more effectively help do things like calculate the price of a custom pizza from a blurry menu , find the right bus departure time on a complex schedule, or get a coffee recommendation for a date from a mirrored photo.