copy and paste this google map to your website or blog!
Press copy button and paste into your blog or website.
(Please switch to 'HTML' mode when posting into your blog. Examples: WordPress Example, Blogger Example)
EvalPlus Leaderboard 🤗 More Leaderboards In addition to EvalPlus leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:
SWE-bench Leaderboards SWE-bench Bash Only uses the SWE-bench Verified dataset with the mini-SWE-agent environment for all models [Post] SWE-bench Lite is a subset curated for less costly evaluation [Post] SWE-bench Verified is a human-filtered subset [Post] SWE-bench Multimodal features issues with visual elements [Post] Each entry reports the % Resolved metric, the percentage of instances solved (out of 2294
[2403. 19114] Top Leaderboard Ranking = Top Coding Proficiency, Always . . . LLMs have become the go-to choice for code generation tasks, with an exponential increase in the training, development, and usage of LLMs specifically for code generation To evaluate the ability of LLMs on code, both academic and industry practitioners rely on popular handcrafted benchmarks However, prior benchmarks contain only a very limited set of problems, both in quantity and variety
Big Code Models Leaderboard - a Hugging Face Space by bigcode Submit your code models for evaluation and view a leaderboard comparing various models on performance benchmarks You need to provide the model name, revision, precision, and type The app will add
Vibe Code Arena Benchmark vibe coding performance across different frameworks and versions
LLM Leaderboard 2025 - Vellum LLM Leaderboard This LLM leaderboard displays the latest public benchmark performance for SOTA model versions released after April 2024 The data comes from model providers as well as independently run evaluations by Vellum or the open-source community We feature results from non-saturated benchmarks, excluding outdated benchmarks (e g MMLU) If you want to use these models in your agents