Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text
S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …
but improved evaluation approaches are rarely widely adopted. This issue has become …
An empirical survey on long document summarization: Datasets, models, and metrics
Long documents such as academic articles and business reports have been the standard
format to detail out important issues and complicated subjects that require extra attention. An …
format to detail out important issues and complicated subjects that require extra attention. An …
G-eval: Nlg evaluation using gpt-4 with better human alignment
The quality of texts generated by natural language generation (NLG) systems is hard to
measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE …
measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE …
Towards a unified multi-dimensional evaluator for text generation
Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural
Language Generation (NLG), ie, evaluating the generated text from multiple explainable …
Language Generation (NLG), ie, evaluating the generated text from multiple explainable …
Summeval: Re-evaluating summarization evaluation
The scarcity of comprehensive up-to-date studies on evaluation metrics for text
summarization and the lack of consensus regarding evaluation protocols continue to inhibit …
summarization and the lack of consensus regarding evaluation protocols continue to inhibit …
Mauve: Measuring the gap between neural text and human text using divergence frontiers
As major progress is made in open-ended text generation, measuring how close machine-
generated text is to human language remains a critical open problem. We introduce Mauve …
generated text is to human language remains a critical open problem. We introduce Mauve …
Bertscore: Evaluating text generation with bert
We propose BERTScore, an automatic evaluation metric for text generation. Analogously to
common metrics, BERTScore computes a similarity score for each token in the candidate …
common metrics, BERTScore computes a similarity score for each token in the candidate …
MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance
A robust evaluation metric has a profound impact on the development of text generation
systems. A desirable metric compares system output against references based on their …
systems. A desirable metric compares system output against references based on their …
A survey of the usages of deep learning for natural language processing
Over the last several years, the field of natural language processing has been propelled
forward by an explosion in the use of deep learning models. This article provides a brief …
forward by an explosion in the use of deep learning models. This article provides a brief …
A survey of evaluation metrics used for NLG systems
In the last few years, a large number of automatic evaluation metrics have been proposed for
evaluating Natural Language Generation (NLG) systems. The rapid development and …
evaluating Natural Language Generation (NLG) systems. The rapid development and …