A survey of evaluation metrics used for NLG systems

AB Sai, AK Mohankumar, MM Khapra - ACM Computing Surveys (CSUR …, 2022‏ - dl.acm.org
In the last few years, a large number of automatic evaluation metrics have been proposed for
evaluating Natural Language Generation (NLG) systems. The rapid development and …

BERT: a review of applications in natural language processing and understanding

MV Koroteev - arxiv preprint arxiv:2103.11943, 2021‏ - arxiv.org
In this review, we describe the application of one of the most popular deep learning-based
language models-BERT. The paper describes the mechanism of operation of this model, the …

BLEURT: Learning robust metrics for text generation

T Sellam, D Das, AP Parikh - arxiv preprint arxiv:2004.04696, 2020‏ - arxiv.org
Text generation has made significant advances in the last few years. Yet, evaluation metrics
have lagged behind, as the most popular choices (eg, BLEU and ROUGE) may correlate …

Bertscore: Evaluating text generation with bert

T Zhang, V Kishore, F Wu, KQ Weinberger… - arxiv preprint arxiv …, 2019‏ - arxiv.org
We propose BERTScore, an automatic evaluation metric for text generation. Analogously to
common metrics, BERTScore computes a similarity score for each token in the candidate …

Automatic machine translation evaluation in many languages via zero-shot paraphrasing

B Thompson, M Post - arxiv preprint arxiv:2004.14564, 2020‏ - arxiv.org
We frame the task of machine translation evaluation as one of scoring machine translation
output with a sequence-to-sequence paraphraser, conditioned on a human reference. We …

Are references really needed? unbabel-IST 2021 submission for the metrics shared task

R Rei, AC Farinha, C Zerva, D van Stigt… - Proceedings of the …, 2021‏ - aclanthology.org
In this paper, we present the joint contribution of Unbabel and IST to the WMT 2021 Metrics
Shared Task. With this year's focus on Multidimensional Quality Metric (MQM) as the ground …

[PDF][PDF] Results of the wmt16 metrics shared task

O Bojar, Y Graham, A Kamran… - Proceedings of the First …, 2016‏ - aclanthology.org
This paper presents the results of the WMT16 Metrics Shared Task. We asked participants of
this task to score the outputs of the MT systems involved in the WMT16 Shared Translation …

A survey on evaluation metrics for machine translation

S Lee, J Lee, H Moon, C Park, J Seo, S Eo, S Koo… - Mathematics, 2023‏ - mdpi.com
The success of Transformer architecture has seen increased interest in machine translation
(MT). The translation quality of neural network-based MT transcends that of translations …

RUSE: Regressor using sentence embeddings for automatic machine translation evaluation

H Shimanaka, T Kajiwara… - Proceedings of the Third …, 2018‏ - aclanthology.org
We introduce the RUSE metric for the WMT18 metrics shared task. Sentence embeddings
can capture global information that cannot be captured by local features based on character …

Automatic text evaluation through the lens of Wasserstein barycenters

P Colombo, G Staerman, C Clavel… - arxiv preprint arxiv …, 2021‏ - arxiv.org
A new metric\texttt {BaryScore} to evaluate text generation based on deep contextualized
embeddings eg, BERT, Roberta, ELMo) is introduced. This metric is motivated by a new …