Efficient benchmarking (of language models)

Y Perlitz, E Bandel, A Gera, O Arviv, L Ein-Dor… - arxiv preprint arxiv …, 2023 - arxiv.org
The increasing versatility of language models LMs has given rise to a new class of
benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks …

Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation

M Boubdir, E Kim, B Ermis, M Fadaee… - arxiv preprint arxiv …, 2023 - arxiv.org
Human evaluation is increasingly critical for assessing large language models, capturing
linguistic nuances, and reflecting user preferences more accurately than traditional …

Pitfalls and outlooks in using COMET

V Zouhar, P Chen, TK Lam, N Moghe… - arxiv preprint arxiv …, 2024 - arxiv.org
The COMET metric has blazed a trail in the machine translation community, given its strong
correlation with human judgements of translation quality. Its success stems from being a …

Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy

B Thompson, N Mathur, D Deutsch… - arxiv preprint arxiv …, 2024 - arxiv.org
Selecting an automatic metric that best emulates human annotators is often non-trivial,
because there is no clear definition of" best emulates." A meta-metric is required to compare …

Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

PS Varadhan, A Gulati, A Sankar, S Anand… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite rapid advancements in TTS models, a consistent and robust human evaluation
framework is still lacking. For example, MOS tests fail to differentiate between similar …

Finding Replicable Human Evaluations via Stable Ranking Probability

P Riley, D Deutsch, G Foster, V Ratnakar… - arxiv preprint arxiv …, 2024 - arxiv.org
Reliable human evaluation is critical to the development of successful natural language
generation models, but achieving it is notoriously difficult. Stability is a crucial requirement …

Translation memories as baselines for low-resource machine translation

R Knowles, P Littell - … of the Thirteenth Language Resources and …, 2022 - aclanthology.org
Low-resource machine translation research often requires building baselines to benchmark
estimates of progress in translation quality. Neural and statistical phrase-based systems are …

[PDF][PDF] Robustness in Machine Translation Evaluation.

N Mathur - 2021 - core.ac.uk
Nitika Mathur, Timothy Baldwin, and Trevor Cohn. Towards efficient machine translation
evaluation by modelling annotators. In Proceedings of the Australasian Language …

Calibration and context in human evaluation of machine translation

R Knowles, C Lo - Natural Language Processing - cambridge.org
Human evaluation of machine translation is considered the “gold standard” for evaluation,
but it remains a challenging task for which to define best practices. Recent work has focused …

[PDF][PDF] La lexicographie bilingue en traduction automatique d'une langue peu dotée: une chaîne opératoire pour l'amharique

M MARMONIER - ertim.inalco.fr
La traduction automatique, en tant que champ de recherche scientifique et de
développement technique, témoignait–au tournant des années 2020–d'un intérêt croissant …