
Leaderboards
Frontier AI Evaluations
Discover the SEAL LLM Leaderboards for precise and reliable LLM rankings, where leading large language models (LLMs) are evaluated using a rigorous methodology.
Developed by Scale’s Safety, Evaluations, and Alignment Lab (SEAL), these leaderboards utilize private datasets to guarantee fair and uncontaminated results. Regular updates ensure the leaderboard reflects the latest in AI advancements, making it an essential resource for understanding the performance and safety of top LLMs.
Frontier AI Evaluations
We conduct high-complexity evaluations to expose model failures, prevent benchmark saturation, and push model capabilities —while continuously evaluating the latest frontier models.
Scaling with Human Expertise
Humans design complex evaluations and define precise criteria to assess models, while LLMs scale evaluations—ensuring efficiency and alignment with human judgment.
Robust Datasets
Our leaderboards are built on carefully curated evaluation sets, combining private datasets to prevent overfitting and open-source datasets for broad benchmarking and comparability.
Learn more about our LLM evaluation methodology
Frontier Multimodal Benchmark
Gemini 2.5 Pro Experimental (March 2025)
Claude 3.7 Sonnet Thinking (February 2025)
Gemini 2.0 Flash Thinking (January 2025)
Gemini 2.0 Pro Experimental (February 2025)
GPT-4.5 Preview (February 2025)
Llama 3.2 90B Vision Instruct
Gemini 2.0 Flash Experimental (December 2024)
Gemini 2.5 Pro Experimental (March 2025)
Claude 3.7 Sonnet Thinking (February 2025)
Gemini 2.0 Flash Thinking (January 2025)
Gemini 2.0 Pro Experimental (February 2025)
GPT-4.5 Preview (February 2025)
View Full Ranking ->
Puzzle Solving
Claude 3.7 Sonnet Thinking (February 2025)
Gemini 2.5 Pro Experimental (March 2025)
GPT-4.5 Preview (February 2025)
Claude 3.7 Sonnet (February 2025)
Gemini 2.0 Flash Thinking (January 2025)
Claude 3.5 Sonnet (October 2024)
Pixtral Large (November 2024)
Gemini 2.0 Pro Experimental (February 2025)
Claude 3.7 Sonnet Thinking (February 2025)
Gemini 2.5 Pro Experimental (March 2025)
GPT-4.5 Preview (February 2025)
Claude 3.7 Sonnet (February 2025)
View Full Ranking ->
Realistic multi-turn conversation
Gemini 2.5 Pro Experimental (March 2025)
Claude 3.7 Sonnet Thinking (February 2025)
GPT-4.5 Preview (February 2025)
Claude 3.5 Sonnet (October 2024)
Claude 3.7 Sonnet (February 2025)
Gemini 2.0 Pro Experimental (February 2025)
Gemini 2.0 Flash Thinking Experimental (January 2025)
Gemini 2.5 Pro Experimental (March 2025)
Claude 3.7 Sonnet Thinking (February 2025)
GPT-4.5 Preview (February 2025)
Claude 3.5 Sonnet (October 2024)
Claude 3.7 Sonnet (February 2025)
View Full Ranking ->
Visual Language Understanding
Gemini 2.5 Pro Experimental (March 2025)
Claude 3.7 Sonnet Thinking (February 2025)
Gemini 2.0 Flash Thinking Experimental (January 2025)
Gemini 2.0 Pro Experimental (February 2025)
Claude 3.7 Sonnet (February 2025)
GPT-4.5 Preview (February 2025)
Gemini 2.0 Flash Experimental (December 2024)
Gemini 2.0 Flash (February 2025)
Claude 3.5 Sonnet (October 2024)
Gemini 2.5 Pro Experimental (March 2025)
Claude 3.7 Sonnet Thinking (February 2025)
Gemini 2.0 Flash Thinking Experimental (January 2025)
Gemini 2.0 Pro Experimental (February 2025)
Claude 3.7 Sonnet (February 2025)
View Full Ranking ->
Gemini 2.5 Pro Experimental (March 2025)
Claude 3.7 Sonnet Thinking (February 2025)
Claude 3.7 Sonnet (February 2025)
GPT-4.5 Preview (February 2025)
Gemini 2.5 Pro Experimental (March 2025)
Claude 3.7 Sonnet Thinking (February 2025)
View Full Ranking ->
Gemini 2.5 Pro Experimental (March 2025)
Gemini 2.0 Pro Experimental (February 2025)
Gemini 2.0 Flash Thinking Experimental (January 2025)
GPT-4.5 Preview (February 2025)
Gemini 2.5 Pro Experimental (March 2025)
View Full Ranking ->
If you’d like to add your model to this leaderboard or a future version, please contact seal@scale.com. To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts.