Hallucinations in large multilingual translation models

NM Guerreiro, DM Alves, J Waldendorf… - Transactions of the …, 2023 - direct.mit.edu
Hallucinated translations can severely undermine and raise safety issues when machine
translation systems are deployed in the wild. Previous research on the topic focused on …

COMET-22: Unbabel-IST 2022 submission for the metrics shared task

R Rei, JGC De Souza, D Alves, C Zerva… - Proceedings of the …, 2022 - aclanthology.org
In this paper, we present the joint contribution of Unbabel and IST to the WMT 2022 Metrics
Shared Task. Our primary submission–dubbed COMET-22–is an ensemble between a …

Exploring human-like translation strategy with large language models

Z He, T Liang, W Jiao, Z Zhang, Y Yang… - Transactions of the …, 2024 - direct.mit.edu
Large language models (LLMs) have demonstrated impressive capabilities in general
scenarios, exhibiting a level of aptitude that approaches, in some aspects even surpasses …

The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation

P Fernandes, D Deutsch, M Finkelstein, P Riley… - arxiv preprint arxiv …, 2023 - arxiv.org
Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative
development of MT systems. While considerable progress has been made on estimating a …

xcomet: Transparent machine translation evaluation through fine-grained error detection

NM Guerreiro, R Rei, D van Stigt, L Coheur… - arxiv preprint arxiv …, 2023 - arxiv.org
Widely used learned metrics for machine translation evaluation, such as COMET and
BLEURT, estimate the quality of a translation hypothesis by providing a single sentence …

What makes a good story and how can we measure it? a comprehensive survey of story evaluation

D Yang, Q ** - arxiv preprint arxiv:2408.14622, 2024 - arxiv.org
With the development of artificial intelligence, particularly the success of Large Language
Models (LLMs), the quantity and quality of automatically generated stories have significantly …

Tigerscore: Towards building explainable metric for all text generation tasks

D Jiang, Y Li, G Zhang, W Huang, BY Lin… - … on Machine Learning …, 2023 - openreview.net
We present TIGERScore, a\textbf {T} rained metric that follows\textbf {I} nstruction\textbf {G}
uidance to perform\textbf {E} xplainable, and\textbf {R} eference-free evaluation over a wide …

Findings of the WMT 2023 shared task on quality estimation

F Blain, C Zerva, R Rei, NM Guerreiro… - Proceedings of the …, 2023 - aclanthology.org
We report the results of the WMT 2023 shared task on Quality Estimation, in which the
challenge is to predict the quality of the output of neural machine translation systems at the …

Efficient benchmarking (of language models)

Y Perlitz, E Bandel, A Gera, O Arviv, L Ein-Dor… - arxiv preprint arxiv …, 2023 - arxiv.org
The increasing versatility of language models LMs has given rise to a new class of
benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks …

The inside story: Towards better understanding of machine translation neural evaluation metrics

R Rei, NM Guerreiro, M Treviso, L Coheur… - arxiv preprint arxiv …, 2023 - arxiv.org
Neural metrics for machine translation evaluation, such as COMET, exhibit significant
improvements in their correlation with human judgments, as compared to traditional metrics …