
AI moved fast in 2025. Models became better at reasoning, following long instructions, analyzing images, debugging code, interacting with tools, and helping people solve real problems. To recognize the year’s best models, we’re introducing the first-ever SEAL Models of the Year Awards, based entirely on performance from our SEAL Leaderboards and SEAL Showdown.
Scale’s SEAL Leaderboards measure model performance across key capability areas including reasoning, agentic workflows, multimodal inputs, and safety alignment. Each leaderboard is grounded in high-quality datasets and structured evaluation criteria designed to reflect how models perform in the real world. In 2025, we introduced 15 new benchmarks and published more than 450 evals across more than 50 models. Today, we celebrate the best of the best.
Our inaugural award winners include the best models across six categories. These LLMs delivered the strongest results in our leaderboard analysis and in real-world comparisons:
Best Composite Performance Model: Composite Performance reflects a model’s aggregate results across all of the SEAL Leaderboards.

Best Reasoning Model: Based on SEAL’s reasoning leaderboards, including Humanity’s Last Exam, EnigmaEval, MultiNRC, and Professional Reasoning. These leaderboards test how well a model can work through complicated questions, explain its thinking, and solve multi-step problems.

Best Safety Model: Safety evaluations measure how a model responds under pressure, including prompts designed to deceive, disorient, or elicit harmful behavior. Leaderboards such as MASK and Fortress assess whether models stay consistent, follow safety guidelines, and avoid unsafe outputs even in difficult scenarios.

Best Multimodal Model: Multimodal evaluations test how well a model understands images alongside text. This includes interpreting charts, diagrams, photos, or other visual information and giving accurate answers. The winner was determined by performance across leaderboards like TutorBench and VisualToolBench.

Best Agentic Model: Agentic evaluations assess whether a model can take action on an ambiguous task. Benchmarks require models to plan multi-step tasks, call tools correctly, debug code, invoke APIs, and produce reliable end-to-end solutions. Performance across leaderboards like SWE-Bench Pro, Remote Labor Index, and MCP Atlas determined the winner.

People’s Favorites Model: The People’s Favorite award is determined by performance on SEAL Showdown, a large-scale, in-flow preference evaluation where real users pick the better response in blind head-to-head comparisons. It reflects how models perform in real conversations rather than using test questions or benchmarks.

To assign categories and winners, we used distinct SEAL Leaderboards and SEAL Showdown results that map to each core capability area. Each model received a score based on its placement, which were tallied across the award categories, and the highest-scoring model in each category was named the winner.
The models highlighted here represent meaningful steps forward and the frontier is still wide open. Congrats to the winning teams! We’re excited to see what you build in 2026.
