|
- HumanEval: Hand-Written Evaluation Set - GitHub
HumanEval: Hand-Written Evaluation Set This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code"
- HumanEval: A Benchmark for Evaluating LLM Code Generation Capabilities
HumanEval is a benchmark dataset developed by OpenAI that evaluates the performance of large language models (LLMs) in code generation tasks It has become a significant tool for assessing the capabilities of AI models in understanding and generating code
- HumanEval-XL: A Multilingual Code Generation Benchmark for Cross . . .
By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs
- HumanEval: LLM Benchmark for Code Generation | Deepgram
Since its inception in mid-2021, the HumanEval benchmark has not only become immensely popular but has also emerged as a quintessential evaluation tool for measuring the performance of LLMs in code generation tasks
- HumanEval-V
HumanEval-V is a novel benchmark designed to evaluate the ability of Large Multimodal Models (LMMs) to understand and reason over complex diagrams in programming contexts Unlike traditional multimodal or coding benchmarks, HumanEval-V challenges models to generate Python code based on visual inputs that are indispensable for solving the task
- How to Interpret HumanEval: Can this AI Actually Code? - Statology
HumanEval is a benchmark that tests AI models on their ability to write Python code by presenting them with 164 programming problems and measuring how often their solutions pass a comprehensive test suite
|
|
|