Target Leaderboards & Competitive Thresholds

LeaderboardPrimary MetricBenchmarksWhy
EvalPluspass@1HumanEval+, MBPP+Rigorous test suites (80x/35x more tests than originals) expose real quality — the gold standard
BigCodeBenchpass@11,140 practical tasksTests library usage, I/O, and dependencies — not yet saturated (GPT-4o scores ~61%)
LiveCodeBenchpass@11,055 fresh competitive problemsContinuously refreshed from LeetCode/CodeForces — contamination-resistant
BigCode Modelspass@1HumanEval, MBPP, MultiPL-ECode generation visibility — our primary use case

3.1 Competitive Score Thresholds (2025-2026)

HumanEval is approaching saturation (SOTA 92.7%). BigCodeBench and LiveCodeBench differentiate more meaningfully.

BenchmarkNot CompetitiveEntryStrongSOTA (Open)
HumanEval (pass@1)<60%60-75%75-85%85-93%
HumanEval+ (pass@1)<70%70-80%80-85%85-89%
MBPP (pass@1)<70%70-80%80-85%85-91%
BigCodeBench-Full (pass@1)<30%30-40%40-50%50%+
LiveCodeBench (pass@1)<20%20-40%40-60%60%+

3.2 The Landscape: Who Holds the Crown

32B class — current SOTA:

ModelHumanEvalHE+MBPPLiveCodeLicense
Qwen2.5-Coder-32B-Instruct92.7%87.2%90.2%31.4%Apache-2.0
OCR-Nemotron-32B61.8%Apache-2.0
R1-Distill-Qwen-32B58.1%MIT
DeepSeek-Coder-V2 (236B MoE)85.4%82.3%Restricted
Codestral 25.01 (22B)86.6%91.2%Restricted

7B class — current SOTA:

ModelHumanEvalHE+MBPPLiveCodeLicense
Qwen2.5-Coder-7B-Instruct87.8%84.1%83.5%18.2%Apache-2.0
OCR-Nemotron-7B51.3%Apache-2.0
DeepSeek-Coder-V2-Lite (16B MoE)81.1%Restricted
Phi-4 (14B)82.6%MIT

†EvalPlus leaderboard score. Qwen model card reports 88.4% (different test harness).

Critical gap: Qwen2.5-Coder dominates standard benchmarks (HumanEval, MBPP) but falls behind on LiveCodeBench. The gap is reasoning: OCR-Nemotron-32B (distilled from DeepSeek-R1) nearly doubles Qwen's LiveCodeBench score. This is the improvement vector.