The Hitchhikers Guide to Open Source Models: Part 1

Benchmarks and Leaderboards
With more than 2 million available models on Hugging Face it can be overwhelming to know where to get started. In this series of posts I'll look to break down how to evaluate the best model for a variety of use cases alongside providing links to useful resources for evaluating options in the future. In part 1 I'll be diving into how we generally evaluate LLM performance across various tasks, some of the challenges with these approaches, and wrap up with an overview of useful links and resources relevant to specific aspects of model performance.
What Goes Into A Benchmark
Before we talk about leaderboards it's probably necessary to describe a bit about how AI model performance is evaluated and compared across variants.
The standard approach is to ask agents from each model to resolve a benchmark collection of tasks with well-defined solutions. Overall performance on the benchmark is then scored, providing scores we can use to compare between models. Constructing a good benchmark can be surprisingly difficult though; large language models have the interesting feature of memorizing content they've been trained on, and in many cases their training data includes virtually the entire corpus of written human language. This means the benchmark often needs to be written from scratch and refreshed constantly as new models are released.
Generally speaking, benchmarks fall into two buckets. They are either specialized on a task or domain (coding, high school math, ...) or generalize across as wide a range of knowledge as possible. As a rule, large models outperform small models but small fine-tuned models regularly outperform their larger brethren on specialized tasks. It's a balancing act between need, performance, and operating cost/complexity.
Up until March of 2025 Hugging Face operated one of the largest leaderboards. It was shutdown after recognizing that model designers were effectively "teaching to the test," however, there are other approaches to performance evaluation than the one outlined above. LMArena constructs the equivalent of an ELO rating for LLMs by pitting models against each other in head to head queries using human (and more recently, other LLM) judges to evaluate the qualitative merit of their responses.
What's ultimately most important is choosing benchmarks that reflects your problem domain. So with that let's talk a bit about the types of benchmarks currently out in the wild and their domain of usefulness.
Benchmark Taxonomy (2025)
Now that we have a little context surrounding what these benchmarks are we can talk about specific benchmarks, what they mean, and the sorts of tasks they can help you evaluate.
Domain | Benchmarks (Key) | Notes 2025 |
---|---|---|
Reasoning | MMLU, ARC, BBH, GPQA | MMLU saturated; MMLU-Pro emerging |
Math | GSM8K, MATH, MiniF2F | Math is still a big separator |
Coding | HumanEval, MBPP, LiveCodeBench, SWE-Bench | HumanEval saturated; LiveCodeBench preferred |
Chat | MT-Bench, IFEval, AlpacaEval, Arena Elo | Preference-based, multi-turn focus |
Safety | TruthfulQA, ToxiGen, SEAL | Enterprise critical; some closed-source only |
Long-Context | Needle-in-a-Haystack, LongBench | Key differentiator with 128k–256k contexts |
Domain-Specific | MedQA, FinQA, LegalBench | Specialized models dominate their verticals |
Reasoning & Knowledge
- MMLU /MMLU-Pro – academic & professional knowledge across 57 subjects.
- ARC (AI2 Reasoning Challenge) – fluid intelligence.
- BIG-Bench / BIG-Bench Hard (BBH) – diverse reasoning with hard cases.
- GPQA - tests general intelligence with graduate level questions designed to be difficult to solve using keyword search
As of 2025, MMLU is largely saturated, meaning that models have effectively memorized the test making it much less useful for comparing performance across model families. A newer version called MMLU-Pro is available and remains valuable.
Just as an aside, the ARC prize / leaderboard is a really interesting project lead by François Chollet focused on building a dataset of tasks which are easy for humans but remain surprisingly difficult for language models.
Math & Logic
- GSM8K – grade-school math word problems.
- MATH – advanced competition-level math.
- Drop, ProofWriter, MiniF2F – specialized logical/mathematical reasoning tasks.
Coding & Software Engineering
- HumanEval – functional code completion (saturated).
- MBPP (Mostly Basic Python Problems) – smaller coding tasks.
- LiveCodeBench – dynamic, contamination-resistant coding benchmark.
- SWE-Bench – real-world software bug fixing and pull requests.
Chat & Instruction-Following
- MT-Bench – multi-turn chat quality, GPT-4 as judge.
- IFEval – checks strict adherence to instructions. These are particularly relevant for tasks requiring structured outputs.
- AlpacaEval / LMSys Arena (Elo ratings) – preference-based chat evaluation
Safety & Truthfulness
- TruthfulQA – factual accuracy, avoiding common misconceptions. This is in many ways an easily gamed benchmark but scores will appear somewhat frequently in the wild.
- ToxiGen, RealToxicityPrompts – checks harmful or biased outputs.
- Safety Benchmarks (SEAL / Scale AI) – private evals on robustness, honesty, and misuse.
You can find an overview of other safety related benchmarks here. Defining "truth" turns out to be a pretty hard task so as a rule these benchmarks should be relied upon only gingerly.
Long-Context & Memory
- Needle-in-a-Haystack – retrieval of small fact in long input.
- LMSys LongBench – multi-document tasks with long input.
- Synthetic retrieval tests – verifying memory across 32k–256k context.
Domain-Specific Vertical Benchmarks
- MedQA, PubMedQA – medical reasoning.
- FinQA, FiQA – financial question answering.
- LegalBench – legal reasoning and case law analysis.
Leaderboards
While far from an exhaustive list I've tried to collate references that could be useful across each of our benchmark domains. There are a few caveats to be aware of here; first, we've included the Hugging Face Open LLM leaderboard partly out of respect for its history, however, it has been officially archived making it less useful for comparisons across more recent models. Second, the vertical specific benchmarks are often older and may not be as reliable. If you're looking to build in a specific vertical your best bet may well be to evaluate on your own proprietary data.
You'll find a quick table for finding useful leaderboards based on the problem domain below. Additionally, there are specific breakouts for some of the largest leaderboards alongside general strengths and weaknesses for each.
Domain | Relevant Leaderboards |
---|---|
Reasoning & Knowledge | Hugging Face Open LLM, Vellum AI |
Math & Logic | Hugging Face Open LLM, Vellum AI |
Coding & Software Eng. | LiveCodeBench, SWE-Bench, Hugging Face Open LLM |
Chat & Instruction | LMArena, AlpacaEval, Hugging Face Open LLM |
Safety & Alignment | SEAL (Scale AI) |
Long-Context | LongBench (LMSys), LLM-Stats.com |
Verticals (Medical, Legal, Finance) | MedQA, LegalBench, FinQA |
Deployment Metrics | LLM-Stats.com |
LMArena

Overview:
LMArena is a crowdsourced, head-to-head evaluation platform where users compare outputs from two LLMs and vote for the better response. It uses an ELO style rating system to maintain a dynamic leaderboard of open and proprietary models and has become one of the most visible public benchmarks for conversational ability.
Strengths:
- Human preference-based: Captures qualitative performance (helpfulness, coherence) that static benchmarks miss.
- Wide coverage: Includes both open and closed models (GPT-4, Claude, Gemini, DeepSeek, Llama, etc.).
- Dynamic and real-time: Rankings evolve as new models and votes come in.
- Community engagement: High visibility and large participation base.
Limitations:
- Not easily reproducible: Results depend on subjective crowd judgments.
- Vulnerable to gaming: Meta’s “Maverick” variant was optimized for Arena but not released publicly, raising questions about fairness.
- Focus on chat quality: Does not test coding, long context, or domain-specific skills directly.
- Bias: Voting populations may not represent enterprise or real-world users.
vellum

Overview:
Vellum provides real-time LLM evaluation dashboards, emphasizing models released after April 2024. It filters out saturated benchmarks like MMLU, focusing on fresh metrics in reasoning, coding, and general performance. Includes categories like reasoning (e.g., GPQA Diamond). The open source leaderboard can be a useful one stop shop if you know you're not interested in using a proprietary model.
Strengths:
- Curated, up-to-date evaluation.
- Tailored leaderboards per domain (e.g., reasoning, coding).
Limitations:
- Community- or provider-submitted results may vary in rigor.
LLM-Stats.com

Overview:
An interactive dashboard comparing LLMs across metrics like context window length, inference speed, and pricing. It includes both open and proprietary models.
Strengths:
- Real-world relevance: hardware and cost metrics.
- Cross-model comparisons inclusive of proprietary APIs.
Limitations:
- Benchmark coverage may be limited. Generally a useful tool though.
LiveCodeBench

Overview:
LiveCodeBench is a continuously updated benchmark for code LLMs. It evaluates multiple capabilities—code generation, repair, test prediction, and execution—using fresh problems from platforms like LeetCode, AtCoder, and Codeforces. Problems are tagged with release dates to avoid training data contamination, making it more reliable than static datasets like HumanEval.
Strengths:
- Contamination-resistant: Uses newly released competitive programming problems.
- Holistic evaluation: Tests generation, self-repair, execution correctness, not just completion.
- Continuously updated: Maintains benchmark relevance over time.
- Detailed scoring: Includes Pass@1 and Bayesian Elo ratings.
- Pro extension (LiveCodeBench Pro): 584 Olympiad-level curated problems with human expert annotations.
Limitations:
- Code-specific: Doesn’t assess general reasoning or natural language ability.
- Execution overhead: Requires a safe sandbox to run generated code, limiting accessibility.
- High difficulty: Problems are competitive-programming level, which may not reflect everyday coding use cases.