copy and paste this google map to your website or blog!
Press copy button and paste into your blog or website.
(Please switch to 'HTML' mode when posting into your blog. Examples: WordPress Example, Blogger Example)
Towards a Unified Multi-Dimensional Evaluator for Text Generation - ACL . . . In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions
Generating Text from Language Models - ACL Anthology We aim for NLP practitioners and researchers to leave our tutorial with a unified framework which they can use to evaluate and contribute to the latest research in language generation
LLM-Eval: Unified Multi-Dimensional Automatic . . . - ACL Anthology Existing evaluation methods often rely on human annotations, ground-truth responses, or multiple LLM prompts, which can be expensive and time-consuming To address these issues, we design a single prompt-based evaluation method that leverages a unified evaluation schema to cover multiple dimensions of conversation quality in a single model call
A Unified Model for Automated Evaluation of Text Generation Systems - ERCIM With all the recent hype around Large Language Models and ChatGPT in particular, one crucial question is still unanswered: how do we evaluate generated text, and how can this be automated? In this SNF project, we develop a theoretical framework to answer these questions
Towards a Unified Multi-Dimensional Evaluator for Text Generation In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions
Multi-Dimensional Evaluation of Text Summarization with In-Context Learning Evaluation of natural language generation (NLG) is complex and multi-dimensional Generated text can be evaluated for fluency, coherence, factuality, or any other dimensions of interest Most frameworks that perform such multi-dimensional evaluation require training on large manually or synthetically generated datasets In this paper, we study the efficacy of large language models as multi
INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with . . . We evaluate INSTRUCTSCORE on a variety of generation tasks, including translation, captioning, data-to-text, and commonsense generation Experiments show that our 7B model surpasses all other unsupervised metrics, including those based on 175B GPT-3 and GPT-4