Target Leaderboards & Competitive Thresholds

Leaderboard	Primary Metric	Benchmarks	Why
EvalPlus	pass@1	HumanEval+, MBPP+	Rigorous test suites (80x/35x more tests than originals) expose real quality — the gold standard
BigCodeBench	pass@1	1,140 practical tasks	Tests library usage, I/O, and dependencies — not yet saturated (GPT-4o scores ~61%)
LiveCodeBench	pass@1	1,055 fresh competitive problems	Continuously refreshed from LeetCode/CodeForces — contamination-resistant
BigCode Models	pass@1	HumanEval, MBPP, MultiPL-E	Code generation visibility — our primary use case

3.1 Competitive Score Thresholds (2025-2026)

HumanEval is approaching saturation (SOTA 92.7%). BigCodeBench and LiveCodeBench differentiate more meaningfully.

Benchmark	Not Competitive	Entry	Strong	SOTA (Open)
HumanEval (pass@1)	<60%	60-75%	75-85%	85-93%
HumanEval+ (pass@1)	<70%	70-80%	80-85%	85-89%
MBPP (pass@1)	<70%	70-80%	80-85%	85-91%
BigCodeBench-Full (pass@1)	<30%	30-40%	40-50%	50%+
LiveCodeBench (pass@1)	<20%	20-40%	40-60%	60%+

3.2 The Landscape: Who Holds the Crown

32B class — current SOTA:

Model	HumanEval	HE+	MBPP	LiveCode	License
Qwen2.5-Coder-32B-Instruct	92.7%	87.2%	90.2%	31.4%	Apache-2.0
OCR-Nemotron-32B	—	—	—	61.8%	Apache-2.0
R1-Distill-Qwen-32B	—	—	—	58.1%	MIT
DeepSeek-Coder-V2 (236B MoE)	85.4%	82.3%	—	—	Restricted
Codestral 25.01 (22B)	86.6%	—	91.2%	—	Restricted

7B class — current SOTA:

Model	HumanEval	HE+	MBPP	LiveCode	License
Qwen2.5-Coder-7B-Instruct	87.8%†	84.1%	83.5%	18.2%	Apache-2.0
OCR-Nemotron-7B	—	—	—	51.3%	Apache-2.0
DeepSeek-Coder-V2-Lite (16B MoE)	81.1%	—	—	—	Restricted
Phi-4 (14B)	82.6%	—	—	—	MIT

†EvalPlus leaderboard score. Qwen model card reports 88.4% (different test harness).

Critical gap: Qwen2.5-Coder dominates standard benchmarks (HumanEval, MBPP) but falls behind on LiveCodeBench. The gap is reasoning: OCR-Nemotron-32B (distilled from DeepSeek-R1) nearly doubles Qwen's LiveCodeBench score. This is the improvement vector.

APR Leaderboard Specification

Target Leaderboards & Competitive Thresholds

3.1 Competitive Score Thresholds (2025-2026)

3.2 The Landscape: Who Holds the Crown