Bartscore: Evaluating generated text as text generation

W Yuan, G Neubig, P Liu - Advances in Neural Information …, 2021 - proceedings.neurips.cc
A wide variety of NLP applications, such as machine translation, summarization, and dialog,
involve text generation. One major challenge for these applications is how to evaluate …

Summeval: Re-evaluating summarization evaluation

AR Fabbri, W Kryściński, B McCann, C **ong… - Transactions of the …, 2021 - direct.mit.edu
The scarcity of comprehensive up-to-date studies on evaluation metrics for text
summarization and the lack of consensus regarding evaluation protocols continue to inhibit …

Leveraging large language models for nlg evaluation: Advances and challenges

Z Li, X Xu, T Shen, C Xu, JC Gu, Y Lai… - Proceedings of the …, 2024 - aclanthology.org
In the rapidly evolving domain of Natural Language Generation (NLG) evaluation,
introducing Large Language Models (LLMs) has opened new avenues for assessing …

A survey on multi-modal summarization

A Jangra, S Mukherjee, A Jatowt, S Saha… - ACM Computing …, 2023 - dl.acm.org
The new era of technology has brought us to the point where it is convenient for people to
share their opinions over an abundance of platforms. These platforms have a provision for …

MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance

W Zhao, M Peyrard, F Liu, Y Gao, CM Meyer… - arxiv preprint arxiv …, 2019 - arxiv.org
A robust evaluation metric has a profound impact on the development of text generation
systems. A desirable metric compares system output against references based on their …

Bridging the gap: A survey on integrating (human) feedback for natural language generation

P Fernandes, A Madaan, E Liu, A Farinhas… - Transactions of the …, 2023 - direct.mit.edu
Natural language generation has witnessed significant advancements due to the training of
large language models on vast internet-scale datasets. Despite these advancements, there …

Re-evaluating evaluation in text summarization

M Bhandari, P Gour, A Ashfaq, P Liu… - arxiv preprint arxiv …, 2020 - arxiv.org
Automated evaluation metrics as a stand-in for manual evaluation are an essential part of
the development of text-generation tasks such as text summarization. However, while the …

Leveraging large language models for nlg evaluation: A survey

Z Li, X Xu, T Shen, C Xu, JC Gu, C Tao - arxiv e-prints, 2024 - ui.adsabs.harvard.edu
In the rapidly evolving domain of Natural Language Generation (NLG) evaluation,
introducing Large Language Models (LLMs) has opened new avenues for assessing …

SUPERT: Towards new frontiers in unsupervised evaluation metrics for multi-document summarization

Y Gao, W Zhao, S Eger - arxiv preprint arxiv:2005.03724, 2020 - arxiv.org
We study unsupervised multi-document summarization evaluation metrics, which require
neither human-written reference summaries nor human annotations (eg preferences …

What have we achieved on text summarization?

D Huang, L Cui, S Yang, G Bao, K Wang, J **e… - arxiv preprint arxiv …, 2020 - arxiv.org
Deep learning has led to significant improvement in text summarization with various
methods investigated and improved ROUGE scores reported over the years. However, gaps …