Hallucinations in large multilingual translation models
Hallucinated translations can severely undermine and raise safety issues when machine
translation systems are deployed in the wild. Previous research on the topic focused on …
translation systems are deployed in the wild. Previous research on the topic focused on …
COMET-22: Unbabel-IST 2022 submission for the metrics shared task
In this paper, we present the joint contribution of Unbabel and IST to the WMT 2022 Metrics
Shared Task. Our primary submission–dubbed COMET-22–is an ensemble between a …
Shared Task. Our primary submission–dubbed COMET-22–is an ensemble between a …
Exploring human-like translation strategy with large language models
Large language models (LLMs) have demonstrated impressive capabilities in general
scenarios, exhibiting a level of aptitude that approaches, in some aspects even surpasses …
scenarios, exhibiting a level of aptitude that approaches, in some aspects even surpasses …
The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation
Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative
development of MT systems. While considerable progress has been made on estimating a …
development of MT systems. While considerable progress has been made on estimating a …
xcomet: Transparent machine translation evaluation through fine-grained error detection
Widely used learned metrics for machine translation evaluation, such as COMET and
BLEURT, estimate the quality of a translation hypothesis by providing a single sentence …
BLEURT, estimate the quality of a translation hypothesis by providing a single sentence …
What makes a good story and how can we measure it? a comprehensive survey of story evaluation
With the development of artificial intelligence, particularly the success of Large Language
Models (LLMs), the quantity and quality of automatically generated stories have significantly …
Models (LLMs), the quantity and quality of automatically generated stories have significantly …
Tigerscore: Towards building explainable metric for all text generation tasks
We present TIGERScore, a\textbf {T} rained metric that follows\textbf {I} nstruction\textbf {G}
uidance to perform\textbf {E} xplainable, and\textbf {R} eference-free evaluation over a wide …
uidance to perform\textbf {E} xplainable, and\textbf {R} eference-free evaluation over a wide …
Findings of the WMT 2023 shared task on quality estimation
We report the results of the WMT 2023 shared task on Quality Estimation, in which the
challenge is to predict the quality of the output of neural machine translation systems at the …
challenge is to predict the quality of the output of neural machine translation systems at the …
Efficient benchmarking (of language models)
The increasing versatility of language models LMs has given rise to a new class of
benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks …
benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks …
The inside story: Towards better understanding of machine translation neural evaluation metrics
Neural metrics for machine translation evaluation, such as COMET, exhibit significant
improvements in their correlation with human judgments, as compared to traditional metrics …
improvements in their correlation with human judgments, as compared to traditional metrics …