News

The Hitchhikers Guide to Open Source Models: Part 1

Ian Eaves

08 Jul 2025 — 6 min read

Benchmarks and Leaderboards

With more than 2 million available models on Hugging Face it can be overwhelming to know where to get started. In this series of posts I'll look to break down how to evaluate the best model for a variety of use cases alongside providing links to useful resources for evaluating options in the future. In part 1 I'll be diving into how we generally evaluate LLM performance across various tasks, some of the challenges with these approaches, and wrap up with an overview of useful links and resources relevant to specific aspects of model performance.

What Goes Into A Benchmark

Before we talk about leaderboards it's probably necessary to describe a bit about how AI model performance is evaluated and compared across variants.

The standard approach is to ask agents from each model to resolve a benchmark collection of tasks with well-defined solutions. Overall performance on the benchmark is then scored, providing scores we can use to compare between models. Constructing a good benchmark can be surprisingly difficult though; large language models have the interesting feature of memorizing content they've been trained on, and in many cases their training data includes virtually the entire corpus of written human language. This means the benchmark often needs to be written from scratch and refreshed constantly as new models are released.

Generally speaking, benchmarks fall into two buckets. They are either specialized on a task or domain (coding, high school math, ...) or generalize across as wide a range of knowledge as possible. As a rule, large models outperform small models but small fine-tuned models regularly outperform their larger brethren on specialized tasks. It's a balancing act between need, performance, and operating cost/complexity.

Up until March of 2025 Hugging Face operated one of the largest leaderboar d s. It was shutdown after recognizing that model designers were effectively "teaching to the test," however, there are other approaches to performance evaluation than the one outlined above. LMArena constructs the equivalent of an ELO rating for LLMs by pitting models against each other in head to head queries using human (and more recently, other LLM) judges to evaluate the qualitative merit of their responses.

What's ultimately most important is choosing benchmarks that reflects your problem domain. So with that let's talk a bit about the types of benchmarks currently out in the wild and their domain of usefulness.

Benchmark Taxonomy (2025)

Now that we have a little context surrounding what these benchmarks are we can talk about specific benchmarks, what they mean, and the sorts of tasks they can help you evaluate.

Domain	Benchmarks (Key)	Notes 2025
Reasoning	MMLU, ARC, BBH, GPQA	MMLU saturated; MMLU-Pro emerging
Math	GSM8K, MATH, MiniF2F	Math is still a big separator
Coding	HumanEval, MBPP, LiveCodeBench, SWE-Bench	HumanEval saturated; LiveCodeBench preferred
Chat	MT-Bench, IFEval, AlpacaEval, Arena Elo	Preference-based, multi-turn focus
Safety	TruthfulQA, ToxiGen, SEAL	Enterprise critical; some closed-source only
Long-Context	Needle-in-a-Haystack, LongBench	Key differentiator with 128k–256k contexts
Domain-Specific	MedQA, FinQA, LegalBench	Specialized models dominate their verticals

Reasoning & Knowledge

MMLU /MMLU-Pro – academic & professional knowledge across 57 subjects.
ARC (AI2 Reasoning Challenge) – fluid intelligence.
BIG-Bench / BIG-Bench Hard (BBH) – diverse reasoning with hard cases.
GPQA - tests general intelligence with graduate level questions designed to be difficult to solve using keyword search

As of 2025, MMLU is largely saturated, meaning that models have effectively memorized the test making it much less useful for comparing performance across model families. A newer version called MMLU-Pro is available and remains valuable.

Just as an aside, the ARC prize / leaderboard is a really interesting project lead by François Chollet focused on building a dataset of tasks which are easy for humans but remain surprisingly difficult for language models.

Math & Logic

GSM8K – grade-school math word problems.
MATH – advanced competition-level math.
Drop, ProofWriter, MiniF2F – specialized logical/mathematical reasoning tasks.

Coding & Software Engineering

HumanEval – functional code completion (saturated).
MBPP (Mostly Basic Python Problems) – smaller coding tasks.
LiveCodeBench – dynamic, contamination-resistant coding benchmark.
SWE-Bench – real-world software bug fixing and pull requests.

Chat & Instruction-Following

MT-Bench – multi-turn chat quality, GPT-4 as judge.
IFEval – checks strict adherence to instructions. These are particularly relevant for tasks requiring structured outputs.
AlpacaEval / LMSys Arena (Elo ratings) – preference-based chat evaluation

Safety & Truthfulness

TruthfulQA – factual accuracy, avoiding common misconceptions. This is in many ways an easily gamed benchmark but scores will appear somewhat frequently in the wild.
ToxiGen, RealToxicityPrompts – checks harmful or biased outputs.
Safety Benchmarks (SEAL / Scale AI) – private evals on robustness, honesty, and misuse.

You can find an overview of other safety related benchmarks here. Defining "truth" turns out to be a pretty hard task so as a rule these benchmarks should be relied upon only gingerly.

Long-Context & Memory

Needle-in-a-Haystack – retrieval of small fact in long input.
LMSys LongBench – multi-document tasks with long input.
Synthetic retrieval tests – verifying memory across 32k–256k context.

Domain-Specific Vertical Benchmarks

MedQA, PubMedQA – medical reasoning.
FinQA, FiQA – financial question answering.
LegalBench – legal reasoning and case law analysis.

Leaderboards

While far from an exhaustive list I've tried to collate references that could be useful across each of our benchmark domains. There are a few caveats to be aware of here; first, we've included the Hugging Face Open LLM leaderboard partly out of respect for its history, however, it has been officially archived making it less useful for comparisons across more recent models. Second, the vertical specific benchmarks are often older and may not be as reliable. If you're looking to build in a specific vertical your best bet may well be to evaluate on your own proprietary data.

You'll find a quick table for finding useful leaderboards based on the problem domain below. Additionally, there are specific breakouts for some of the largest leaderboards alongside general strengths and weaknesses for each.

Domain	Relevant Leaderboards
Reasoning & Knowledge	Hugging Face Open LLM, Vellum AI
Math & Logic	Hugging Face Open LLM, Vellum AI
Coding & Software Eng.	LiveCodeBench, SWE-Bench, Hugging Face Open LLM
Chat & Instruction	LMArena, AlpacaEval, Hugging Face Open LLM
Safety & Alignment	SEAL (Scale AI)
Long-Context	LongBench (LMSys), LLM-Stats.com
Verticals (Medical, Legal, Finance)	MedQA, LegalBench, FinQA
Deployment Metrics	LLM-Stats.com

LMA rena

Overview:
LMArena is a crowdsourced, head-to-head evaluation platform where users compare outputs from two LLMs and vote for the better response. It uses an ELO style rating system to maintain a dynamic leaderboard of open and proprietary models and has become one of the most visible public benchmarks for conversational ability.

Strengths:

Human preference-based: Captures qualitative performance (helpfulness, coherence) that static benchmarks miss.
Wide coverage: Includes both open and closed models (GPT-4, Claude, Gemini, DeepSeek, Llama, etc.).
Dynamic and real-time: Rankings evolve as new models and votes come in.
Community engagement: High visibility and large participation base.

Limitations:

Not easily reproducible: Results depend on subjective crowd judgments.
Vulnerable to gaming: Meta’s “Maverick” variant was optimized for Arena but not released publicly, raising questions about fairness.
Focus on chat quality: Does not test coding, long context, or domain-specific skills directly.
Bias: Voting populations may not represent enterprise or real-world users.

vellum

Overview:
Vellum provides real-time LLM evaluation dashboards, emphasizing models released after April 2024. It filters out saturated benchmarks like MMLU, focusing on fresh metrics in reasoning, coding, and general performance. Includes categories like reasoning (e.g., GPQA Diamond). The open source leaderboard can be a useful one stop shop if you know you're not interested in using a proprietary model.

Strengths:

Curated, up-to-date evaluation.
Tailored leaderboards per domain (e.g., reasoning, coding).

Limitations:

Community- or provider-submitted results may vary in rigor.

LLM-Stats.com

Overview:
An interactive dashboard comparing LLMs across metrics like context window length, inference speed, and pricing. It includes both open and proprietary models.

Strengths:

Real-world relevance: hardware and cost metrics.
Cross-model comparisons inclusive of proprietary APIs.

Limitations:

Benchmark coverage may be limited. Generally a useful tool though.

LiveCodeBench

Overview:
LiveCodeBench is a continuously updated benchmark for code LLMs. It evaluates multiple capabilities—code generation, repair, test prediction, and execution—using fresh problems from platforms like LeetCode, AtCoder, and Codeforces. Problems are tagged with release dates to avoid training data contamination, making it more reliable than static datasets like HumanEval.

Strengths:

Contamination-resistant: Uses newly released competitive programming problems.
Holistic evaluation: Tests generation, self-repair, execution correctness, not just completion.
Continuously updated: Maintains benchmark relevance over time.
Detailed scoring: Includes Pass@1 and Bayesian Elo ratings.
Pro extension (LiveCodeBench Pro): 584 Olympiad-level curated problems with human expert annotations.

Limitations:

Code-specific: Doesn’t assess general reasoning or natural language ability.
Execution overhead: Requires a safe sandbox to run generated code, limiting accessibility.
High difficulty: Problems are competitive-programming level, which may not reflect everyday coding use cases.

The Hitchhikers Guide to Open Source Models: Part 1

Ian Eaves

Benchmarks and Leaderboards

What Goes Into A Benchmark

Benchmark Taxonomy (2025)

Reasoning & Knowledge

Math & Logic

Coding & Software Engineering

Chat & Instruction-Following

Safety & Truthfulness

Long-Context & Memory

Domain-Specific Vertical Benchmarks

Leaderboards

LMA rena

vellum

LLM-Stats.com

LiveCodeBench

Read more

Data shows public AI repos may be quietly becoming a supply chain risk

On-the-Fly Document Context with RamaLama