VisualToolBench (VTB)

Overview

VisualToolBench (VTB) is the first benchmark designed to evaluate how well Multimodal Large Language Models (MLLMs) can dynamically interact with and reason about visual information. It shifts the paradigm from passively “thinking about images” to actively “thinking with images,” treating them as a manipulable cognitive workspace. To solve complex, multi-step problems, models must use tools to transform visual content by cropping, editing, or enhancing it to uncover critical details. The benchmark provides leaderboard results across 16 diverse MLLMs, including reasoning, non-reasoning, open-source, and closed-source models.

Dataset Design and Composition

The VTB dataset was constructed to mirror the complexity of real-world user needs through the following features:

Expert-Curated Tasks: All 1,204 tasks were authored by human contributors with domain expertise.
1. The creation pipeline was rigorous: contributor training → task authoring → multi-model pre-grading → dual review → integration.
2. Difficulty was ensured: Tasks were only retained if at least two of three top MLLMs failed them during pre-grading, guaranteeing a challenging benchmark.
Diverse and Challenging Content: The dataset is composed of 1,204 open-ended questions and 2,893 images, featuring:
1. Interaction Types: A near-even split between 603 single-turn and 601 multi-turn conversations. Categories include Region-Switch Q&A, Hybrid Tool-Reasoning, Follow-up Test, Temporal Reasoning, and Progressive Reasoning.
2. Domains: Five balanced domains: STEM (19.7%), Medical (19.7%), Finance (20.2%), Sports (20.0%), and Generalist (20.4%).
3. Statistics: Average prompt length is 48.4 tokens; average answer length is 128.9 tokens.
Core Principles: The design emphasizes four key principles:
1. Non-trivial visual perception
2. Realistic task settings
3. Implicit tool-use
4. Multi-step, compositional reasoning

Evaluation Methodology

Evaluation is conducted in a tool-augmented environment and assessed by a detailed rubric system.

Standardized Toolset: Models are provided a compact API of six tools. The cornerstone is python_image_processing, which enables iterative manipulation and re-ingestion of images during reasoning. This is supplemented by general-purpose tools like a Python interpreter, web search, and a calculator.
- Tool Call Constraints: Models are capped at 20 tool calls per task, whereas human reference trajectories typically require fewer than 5 calls, highlighting the inefficiency of current models.
Rubric-Based Assessment: Performance is measured using a set of 7,777 rubrics.
- This system captures nuanced performance across five dimensions:
  - Visual Understanding
  - Truthfulness
  - Instruction Following
  - Reasoning
  - Presentation
- LLM-as-Judge: For reproducibility and scale, o4-mini is used as the automatic judge, achieving an ~88% alignment with human annotations. Agreement is higher on objective rubrics (90%+) and lower on subjective ones (~78%).

Key Performance Metrics

The leaderboard relies on two main metrics derived from the rubrics:

Average Pass Rate (APR): The primary metric for success. Rubrics with a weight of 4 or 5 are designated as "critical," and a response earns a "pass" only if it satisfies ALL of these critical rubrics. This strict requirement explains why APR is significantly lower than ARS.
Average Rubric Score (ARS): A granular metric that awards partial credit by calculating the weighted proportion of all satisfied rubrics for a task.

Key Findings

VTB poses a significant challenge for the current generation of MLLMs.

Highly Challenging Benchmark: All 16 evaluated MLLMs struggle significantly.
- The top-performing model, GPT-5-think, achieved an overall APR of just 18.44%.
- 11 of the 16 models scored an APR below 10%, indicating substantial room for improvement.
Clear Performance Tiers & Open-Source Gap: OpenAI models (GPT-5, GPT-5-think, o3, o4-mini) currently lead the leaderboard.
- There is a stark ~10-17x performance gap between top closed-source models and open-source models like Llama4-Maverick (1.16% APR), Llama4-Scout (1.65% APR), and Pixtral-large (1.43% APR).
Visual Perception is the Main Hurdle: A detailed error analysis shows that visual perception errors are the most common failure mode, accounting for 70-82% of all failures. The bottleneck lies in perception, not logic.
Divergent & Inefficient Tool Use:
- Proactivity (% of tasks with ≥1 tool call) varies widely; GPT-5-think is highly proactive (99.5%), while some models are below 20%. The Success Rate of valid calls for GPT-5-think is 86.6%.
- More calls do not equal better performance. For instance, o3 makes 16,116 total calls vs. GPT-5's 10,212, yet GPT-5 performs better, showing that tool-use efficiency is critical.
- An ablation study revealed divergent behaviors: GPT-5's performance drops 11-14% without tools, whereas Gemini-2.5-pro's performance improves by 2.7%, suggesting its tool-use is not yet well-integrated.

Read the paper here.

Overview

Dataset Design and Composition

The VTB dataset was constructed to mirror the complexity of real-world user needs through the following features:

Expert-Curated Tasks: All 1,204 tasks were authored by human contributors with domain expertise.
1. The creation pipeline was rigorous: contributor training → task authoring → multi-model pre-grading → dual review → integration.
2. Difficulty was ensured: Tasks were only retained if at least two of three top MLLMs failed them during pre-grading, guaranteeing a challenging benchmark.
Diverse and Challenging Content: The dataset is composed of 1,204 open-ended questions and 2,893 images, featuring:
1. Interaction Types: A near-even split between 603 single-turn and 601 multi-turn conversations. Categories include Region-Switch Q&A, Hybrid Tool-Reasoning, Follow-up Test, Temporal Reasoning, and Progressive Reasoning.
2. Domains: Five balanced domains: STEM (19.7%), Medical (19.7%), Finance (20.2%), Sports (20.0%), and Generalist (20.4%).
3. Statistics: Average prompt length is 48.4 tokens; average answer length is 128.9 tokens.
Core Principles: The design emphasizes four key principles:
1. Non-trivial visual perception
2. Realistic task settings
3. Implicit tool-use
4. Multi-step, compositional reasoning

Evaluation Methodology

Evaluation is conducted in a tool-augmented environment and assessed by a detailed rubric system.

Standardized Toolset: Models are provided a compact API of six tools. The cornerstone is python_image_processing, which enables iterative manipulation and re-ingestion of images during reasoning. This is supplemented by general-purpose tools like a Python interpreter, web search, and a calculator.
- Tool Call Constraints: Models are capped at 20 tool calls per task, whereas human reference trajectories typically require fewer than 5 calls, highlighting the inefficiency of current models.
Rubric-Based Assessment: Performance is measured using a set of 7,777 rubrics.
- This system captures nuanced performance across five dimensions:
  - Visual Understanding
  - Truthfulness
  - Instruction Following
  - Reasoning
  - Presentation
- LLM-as-Judge: For reproducibility and scale, o4-mini is used as the automatic judge, achieving an ~88% alignment with human annotations. Agreement is higher on objective rubrics (90%+) and lower on subjective ones (~78%).

Key Performance Metrics

The leaderboard relies on two main metrics derived from the rubrics:

Average Pass Rate (APR): The primary metric for success. Rubrics with a weight of 4 or 5 are designated as "critical," and a response earns a "pass" only if it satisfies ALL of these critical rubrics. This strict requirement explains why APR is significantly lower than ARS.
Average Rubric Score (ARS): A granular metric that awards partial credit by calculating the weighted proportion of all satisfied rubrics for a task.

Key Findings

VTB poses a significant challenge for the current generation of MLLMs.

Highly Challenging Benchmark: All 16 evaluated MLLMs struggle significantly.
- The top-performing model, GPT-5-think, achieved an overall APR of just 18.44%.
- 11 of the 16 models scored an APR below 10%, indicating substantial room for improvement.
Clear Performance Tiers & Open-Source Gap: OpenAI models (GPT-5, GPT-5-think, o3, o4-mini) currently lead the leaderboard.
- There is a stark ~10-17x performance gap between top closed-source models and open-source models like Llama4-Maverick (1.16% APR), Llama4-Scout (1.65% APR), and Pixtral-large (1.43% APR).
Visual Perception is the Main Hurdle: A detailed error analysis shows that visual perception errors are the most common failure mode, accounting for 70-82% of all failures. The bottleneck lies in perception, not logic.
Divergent & Inefficient Tool Use:
- Proactivity (% of tasks with ≥1 tool call) varies widely; GPT-5-think is highly proactive (99.5%), while some models are below 20%. The Success Rate of valid calls for GPT-5-think is 86.6%.
- More calls do not equal better performance. For instance, o3 makes 16,116 total calls vs. GPT-5's 10,212, yet GPT-5 performs better, showing that tool-use efficiency is critical.
- An ablation study revealed divergent behaviors: GPT-5's performance drops 11-14% without tools, whereas Gemini-2.5-pro's performance improves by 2.7%, suggesting its tool-use is not yet well-integrated.

Read the paper here.

VisualToolBench (VTB)

Overview

Dataset Design and Composition

Evaluation Methodology

Key Performance Metrics

Key Findings

Performance Comparison

VisualToolBench (VTB)

Overview

Dataset Design and Composition

Evaluation Methodology

Key Performance Metrics

Key Findings

Performance Comparison