VISTA

Introducing VISTA: A Rubric-based Visual Task Assessment

The growing capabilities of multimodal language models have created two critical challenges in their evaluation. First, as these models become more prevalent, their potential exposure to visual test data during training compromises benchmark integrity and evaluation reliability. Second, traditional evaluation approaches that focus solely on answer accuracy provide limited insight into models' visual reasoning processes, particularly when dealing with open-ended responses where simple answer normalization might prove inadequate.

Introduction

To address these challenges, we present VISTA, a novel multimodal benchmark designed to evaluate complex visual-language understanding. VISTA, short for Visual Task Assessment, tests models' abilities across both natural images and graphical content, requiring the integration of multiple perception skills - from OCR and spatial understanding to object recognition - while engaging broader reasoning capabilities in logic, calculation, and common sense. Each task is evaluated through a structured rubric of yes/no questions that decompose responses into specific testable conditions. This design, combined with the requirement that each question has challenged at least one prominent language model, ensures VISTA serves as a demanding testbed for visual reasoning capabilities.

Existing multimodal benchmarks have largely focused on specialized domains: academic knowledge (e.g., MMMU, AI2D), document understanding (e.g., DocVQA, ChartQA), or specific capabilities like mathematical reasoning (e.g., MathVista, MME). These benchmarks typically rely on multiple-choice questions or normalize free-form responses to enable systematic evaluation. In contrast, VISTA spans a broad spectrum of visual content, from natural photographs to domain-specific graphics, while requiring models to integrate multiple perception abilities in solving complex reasoning tasks. Through free-form responses and conditional evaluation, VISTA aims to reveal a deeper view of the extent of models' visual reasoning capabilities.

In the following sections, we detail VISTA's methodology and demonstrate its effectiveness in evaluating state-of-the-art models.

Dataset Description

VISTA contains 758 prompt-image pairs, comprising 747 unique images, each accompanied by an ideal response and a set of yes/no rubric questions that define the evaluation criteria. To ensure benchmark rigor, each image-prompt pair was carefully designed to challenge at least one prominent language model. The benchmark focuses on single-turn interactions with single-image tasks.

The dataset is structured along multiple dimensions. Tasks are primarily categorized by:

Main reasoning skill required (logic, calculation, common sense, etc.)
Image type (natural or graphic)
Required perception skills (OCR, spatial understanding, object recognition, etc.).

For the benchmark integrity and to assess models' generalization capabilities, Scale will keep the current dataset private until models demonstrate significant improvement.

To maintain realistic conditions, we kept prompt lengths moderate. Rather than artificially extending prompts, annotators were instructed to create complexity through concise but challenging questions. This approach better reflects real-world usage patterns than lengthy, contrived prompts.

In designing the tasks, we initially created a balanced distribution of required perception skills across prompts. While annotators were given this foundation, they had the flexibility to change and incorporate additional perception skills as needed. Our analysis revealed that the most frequently required perception skills in VISTA's visual reasoning prompts are: attribute recognition, object recognition, Optical Character Recognition (OCR), and spatial understanding.

Rubric questions are a crucial component of VISTA's task design, serving to assess the accuracy of model responses. Rather than enforcing a fixed number of questions per rubric, we allowed annotators to include as many questions as necessary to evaluate response correctness. As a result, rubrics vary in length, ranging from single-question assessments to more complex evaluations containing up to seven questions.

Table 1. The number of conditions that need to be satisfied to assess the correctness of a response might vary. The table reports the distribution of the number of rubric questions per task.

Data Collection

The task collection process has involved 148 contributors with Bachelors & Masters Degrees and skills in Teaching, Research, Coding, Customer Service, Data Analysis, Supervision and STEM. The dataset resulted in the collection of 1200 tasks among which a total of 758 (63%) passed the quality control checks to constitute the current evaluation set (see steps 4 and 5 of the data collection workflow).

Workflow

The VISTA dataset was created through a rigorous multi-stage process designed to ensure task quality and complexity. The pipeline consists of five major stages:

Task Assignment: Each contributor receives this specific combination of requirements (primary reasoning skill, image type, and two suggested perception skills) to guide their work in creating challenging and well-defined tasks.
Development Phase: In this phase, contributors source appropriate images and create prompts that align with their assigned parameters. To ensure task complexity, each prompt must engage at least two distinct perception skills. Two validation checks are applied to confirm both the task's challenging nature and its genuine requirement for visual understanding:
1. Visual Difficulty: the assigned model must fail task
2. Visual Requirement: the assigned model must not be able to solve the task without seeing the image
Response Documentation: Contributors provide comprehensive documentation for each task, ensuring clarity and evaluability of responses through structured answer components. This consists of an Ideal Response composed of a "Main Response" that synthesizes the final answer, and a "Reasoning" section that details the logical process leading to that conclusion.
Quality control: Our quality control process involves multiple layers of review and refinement.
1. Assess Correctness of Ideal Response: Initial reviewers verify both the correctness of ideal responses and prompt compliance with task requirements.
2. Correctness Rubrics: Following this, specialized contributors develop a set of yes/no questions (and expected response) that define specific criteria for response correctness. Specifically, these questions highlight response pieces that need to be validated to assess the correctness of a response and serve as the foundation for our evaluation methodology.
Final Auditing and Filtering: The final stage employs independent annotators for thorough quality audits, with improvements assessed and implemented by highly trained reviewers.

Through this comprehensive process, we retained a final set of 758 high-quality tasks that meet our rigorous standards. Common reasons to discard tasks have been: prompt ambiguity, rubric specificity, and response correctness.

Evaluation Methodology

Models are evaluated through a comprehensive dual-assessment approach. Each evaluation is repeated three times for statistical robustness, with results presented as averages with standard deviations.

Free-form Response Assessment: Our first evaluation approach assesses the overall correctness of model responses. Three LLM judges independently evaluate if a model's response reaches the same conclusion as the ideal response's "Main response" section. This is a boolean assessment where responses are labeled as either correct or incorrect, with final decisions made through majority voting. This jury system helps avoid potential biases or preferences toward specific models.
1. Metric: Accuracy [0,1]
  1. Represents the percentage of tasks correctly answered by the model
  2. Based on majority vote across three judges
  3. Lower values indicate poorer performance
Rubric-based Assessment: Our second approach provides a more granular evaluation of response quality. As with the free-form assessment, we employ a jury of three LLM judges to mitigate potential biases. Each judge independently evaluates responses against specific rubric questions, with separate prompting for each question. Results are aggregated at the task level to compute a task-specific accuracy rate. The assessments of individual judges are aggregated before computing the task-specific accuracy rates.
1. Metric: Average Task Accuracy (ATA) [0,1]
  1. Task Accuracy: Percentage of rubric questions satisfied per task
  2. ATA: Average of task accuracies across all evaluation tasks
  3. Higher values indicate better satisfaction of correctness criteria
  4. Provides insight into partial correctness of responses

The key difference between these approaches lies in their granularity. While the free-form assessment provides a binary measure of complete correctness, the rubric-based assessment captures partial correctness through more nuanced evaluation. This granularity is particularly valuable when evaluating advanced models, as it can reveal how they not only reach correct final answers but also satisfy additional correctness criteria defined by annotators. This deeper evaluation is reflected in the leaderboards, where the rubric-based assessment typically shows smoother performance differences between models, especially among top performers who may achieve similar accuracy scores but differ in their ability to meet these additional quality requirements.

The LLMs composing the jury for both evaluation approaches are claude-3-5-sonnet-20240620, gpt-4o, and gemini-1.5-pro.

Evaluation Results

Rubric-based assessments provide a more fine-grained evaluation by breaking down task completion into multiple checkpoints. Rather than a binary right/wrong judgment, this approach can reveal when models partially succeed at complex tasks, offering a more nuanced view of model performance. For instance, a model might correctly identify key elements in an image but fail to draw accurate conclusions from them.

Free-form evaluations, conversely, mirror real-world scenarios where partial success may not be sufficient. This method sets a higher bar by requiring complete and accurate responses, similar to how users would judge model responses in practical applications.

Given these distinct perspectives, using both evaluation methods provides complementary insights into model capabilities. The rubric-based approach helps quantify partial successes, while free-form evaluation measures practical effectiveness. Together, they offer a more complete picture of current model limitations and capabilities.

Human Baseline

To establish a human performance baseline for VISTA, we conducted a controlled evaluation study with 16 full-time employees across 101 tasks. Each participant was given three minutes per task and had access to high-resolution images and internet resources, simulating realistic visual reasoning scenarios while maintaining consistent evaluation conditions.

To prioritize accuracy over explanation, participants were asked to provide only their final answers within the time constraint, without requiring detailed reasoning. The study revealed a human accuracy rate of 55.40% across all tasks. When compared to the average performance of models currently on the leaderboard (see Figure 8), the results expose a substantial gap between human and model capabilities in visual reasoning tasks, highlighting significant room for improvement in current systems.

Figure 8. We report the human performance of 16 full-time employees on 101 tasks and compare it with the average performance of the leaderboard’s models. There is significant room for improvement to fill the gap with humans and go beyond. The error bar corresponds to the accuracy standard deviation.

We tracked two key factors during the study: whether participants felt the 3-minute time limit was sufficient and whether they used internet resources. By analyzing this feedback alongside response accuracy, we found that participants considered approximately 75% of tasks to be completable within the 3-minute window. Notably, even for tasks deemed manageable within the time limit, participants still produced a significant number of incorrect responses.

This finding is particularly interesting as it reveals that tasks considered straightforward by humans within a limited timeframe still prove challenging for current models. Table 2 breaks down response accuracy based on participants' perceived time needs:

Table 2. Response accuracy percentages categorized by participants' assessed time needs for task completion.

Limitations

While Scale has invested significant effort in developing this novel rubric-based evaluation framework, we acknowledge several important limitations. The process of crafting effective rubric-based evaluations is still evolving, and the extensive training required for contributors to write precise, comprehensive rubrics represents a substantial challenge. We recognize that the current rubrics, while functional, have considerable room for improvement and enrichment.

Acknowledgments

This project was made possible by the dedicated efforts of a team of expert annotators. We extend our gratitude to everyone involved in the human studies and refinement of the dataset and the verification methodology, especially to Spencer Whitehead and Chen Xing for the initial task taxonomy definition.

Scale AI Team: Cristina Menghini, Ernesto Hernandez, Diego Mares, Dean Lee, Mike Lunati, Summer Yue

Image Type

Image Type	Description	Examples
Natural Images	Images depicting objects or scenes from the real world.	Photographs, Paintings, Sketches, Portraits, Comics/Cartoons
Graphics	Images presenting information visually, like charts, graphs, or technical diagrams.	Diagrams, Tables, Plots, Maps, Timelines, Infographics, Puzzle testing

Reasoning Skills

Skill Name	Description	Examples of prompts (w/o image)
Calculations / Counting	Prompts that test the model’s ability to perform numerical operations based on the image or count visual elements within images or scenes.	Can you provide the total revenue of films that Disney owns the right to? I'm cooking for six people and I'd like to have a half ear of corn for each person. How many more ears should I add to the grill? Assuming it takes 5 minutes to clean each table, how long will it take to complete all of the outdoor tables?
Knowledge: General, Coding, Math & Geometry, Miscellaneous (chemistry, music, biology, others)	Prompts that require specific knowledge of a topic or area in order to be correctly answered.	In the image, if the line "a = 30" was added directly in the middle of the print statements (overwriting existing code lines), what line number would it be and what would be the output for the entire code? Three closed boxes are stacked on the ground as shown in the image (with given dimensions in cm). What is the area of the surfaces that can be seen without moving any of the boxes? Note: You can move around the boxes, but you can't move any of them. Which three cell lines had the lowest expression using the anti CDK2 antibody? Can you explain how you interpreted this from the graph?
Logic	Prompts that request for some logical conclusion based on the image and, potentially, some given hypotheses or logical statements.	Can you give me two puzzle words that overlap on the grid? Which part of this route includes backtracking along the same road? Based on the diagrams of these plays, are they defensive or offensive designs and how should a player be able to tell?
Common Sense	Prompts that request information that can be reasonably inferred from the visual content but is not necessarily self-evident or even representable visually.	According to the form of the pit fire, considering how many chairs are around its half, how many extra chairs, besides the ones that already are there, could I fit around it keeping the same space between them as in the image? What kind of activities are taking place in this photo? Assuming that this photo is in the Northern United States, what time of year is it? If the building in the center of the photo is south of the photographer, then what direction do the people in the center of the photo appear to be heading? Why might they be leaving, and has anyone been hurt?
Hallucination	Prompts that request information that is not present or deducible from the image.	Is this the beginning or the end of the interview? How can you tell? In the image, identify the name of the florist shop and determine whether it is likely that they have a large inventory or a small inventory. Sum the percentages of the only candy above 15% and the only candy below 2.5% from the chart.
Information Extraction	Prompts that request to find information in the image and re-organize it following specific formats such as lists, csv, tables, sorting, visual sorting, etc.	This is a picture of Carrigogunnell Castle in County Limerick, Ireland. There are a couple of clues in this picture that indicate if the weather is likely to be temperate. Put the clues in a bulleted list and put your conclusion in bold. Create a markdown table with headers for the teams and the number of games drawn in descending order. Do not include teams based in London. Based on this image, provide a list of the five flights with the highest flight number, in order from highest to lowest. Of those flights, what is the time difference between the earliest and latest scheduled departure?

Perception Skills

Skill	Description
OCR	Prompts that require the model to extract text from images or scanned documents.
Object recognition	Prompts that require the model to identify and categorize objects within an image or scene.
Attribute/relationship recognition	Prompts that require the model to identify specific characteristics (e.g., color, shape, texture) or semantic relationships (e.g., person holding a ball, dog’s paw) associated with objects in an image.
Action recognition	Prompts that require the model to identify and categorize actions or activities of objects in the image.
Spatial understanding	Prompts that request spatial information for objects in the image. They require a model’s understanding of the layout of objects in the scene to answer.
Named visual entity recognition	Prompts that require the model to identify and categorize by name well-known visual elements, such as public figures, landmarks, logos, etc.
Emotion/sentiment recognition	Prompts that require the model to identify the emotional state or sentiment of individuals in the image.

Loading content...