Llm-based nlg evaluation: Current status and challenges

M Gao, X Hu, J Ruan, X Pu, X Wan - arxiv preprint arxiv:2402.01383, 2024 - arxiv.org
Evaluating natural language generation (NLG) is a vital but challenging problem in artificial
intelligence. Traditional evaluation metrics mainly capturing content (eg n-gram) overlap …

xcomet: Transparent Machine Translation Evaluation through Fine-grained Error Detection

NM Guerreiro, R Rei, D Stigt, L Coheur… - Transactions of the …, 2024 - direct.mit.edu
Widely used learned metrics for machine translation evaluation, such as Comet and Bleurt,
estimate the quality of a translation hypothesis by providing a single sentence-level score …

Error analysis prompting enables human-like translation evaluation in large language models

Q Lu, B Qiu, L Ding, K Zhang, T Kocmi… - arxiv preprint arxiv …, 2023 - arxiv.org
Generative large language models (LLMs), eg, ChatGPT, have demonstrated remarkable
proficiency across several NLP tasks, such as machine translation, text summarization …

Adapting large language models for document-level machine translation

M Wu, TT Vu, L Qu, G Foster, G Haffari - arxiv preprint arxiv:2401.06468, 2024 - arxiv.org
Large language models (LLMs) have significantly advanced various natural language
processing (NLP) tasks. Recent research indicates that moderately-sized LLMs often …

Llamax: Scaling linguistic horizons of llm by enhancing translation capabilities beyond 100 languages

Y Lu, W Zhu, L Li, Y Qiao, F Yuan - arxiv preprint arxiv:2407.05975, 2024 - arxiv.org
Large Language Models (LLMs) demonstrate remarkable translation capabilities in high-
resource language tasks, yet their performance in low-resource languages is hindered by …

Navigating the metrics maze: Reconciling score magnitudes and accuracies

T Kocmi, V Zouhar, C Federmann, M Post - arxiv preprint arxiv …, 2024 - arxiv.org
Ten years ago a single metric, BLEU, governed progress in machine translation research.
For better or worse, there is no such consensus today, and consequently it is difficult for …

Tear: Improving llm-based machine translation with systematic self-refinement

Z Feng, Y Zhang, H Li, B Wu, J Liao, W Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have achieved impressive results in Machine Translation
(MT). However, careful evaluations by human reveal that the translations produced by LLMs …

Assessing the Role of Context in Chat Translation Evaluation: Is Context Helpful and Under What Conditions?

S Agrawal, A Farajian, P Fernandes, R Rei… - Transactions of the …, 2024 - direct.mit.edu
Despite the recent success of automatic metrics for assessing translation quality, their
application in evaluating the quality of machine-translated chats has been limited. Unlike …

Machine translation meta evaluation through translation accuracy challenge sets

N Moghe, A Fazla, C Amrhein, T Kocmi… - Computational …, 2024 - direct.mit.edu
Recent machine translation (MT) metrics calibrate their effectiveness by correlating with
human judgment. However, these results are often obtained by averaging predictions across …

Prexme! large scale prompt exploration of open source llms for machine translation and summarization evaluation

C Leiter, S Eger - arxiv preprint arxiv:2406.18528, 2024 - arxiv.org
Large language models (LLMs) have revolutionized the field of NLP. Notably, their in-
context learning capabilities also enable their use as evaluation metrics for natural language …