Efficient benchmarking (of language models)
The increasing versatility of language models LMs has given rise to a new class of
benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks …
benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks …
Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation
Human evaluation is increasingly critical for assessing large language models, capturing
linguistic nuances, and reflecting user preferences more accurately than traditional …
linguistic nuances, and reflecting user preferences more accurately than traditional …
Pitfalls and outlooks in using COMET
The COMET metric has blazed a trail in the machine translation community, given its strong
correlation with human judgements of translation quality. Its success stems from being a …
correlation with human judgements of translation quality. Its success stems from being a …
Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy
Selecting an automatic metric that best emulates human annotators is often non-trivial,
because there is no clear definition of" best emulates." A meta-metric is required to compare …
because there is no clear definition of" best emulates." A meta-metric is required to compare …
Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation
Despite rapid advancements in TTS models, a consistent and robust human evaluation
framework is still lacking. For example, MOS tests fail to differentiate between similar …
framework is still lacking. For example, MOS tests fail to differentiate between similar …
Finding Replicable Human Evaluations via Stable Ranking Probability
Reliable human evaluation is critical to the development of successful natural language
generation models, but achieving it is notoriously difficult. Stability is a crucial requirement …
generation models, but achieving it is notoriously difficult. Stability is a crucial requirement …
Translation memories as baselines for low-resource machine translation
Low-resource machine translation research often requires building baselines to benchmark
estimates of progress in translation quality. Neural and statistical phrase-based systems are …
estimates of progress in translation quality. Neural and statistical phrase-based systems are …
[PDF][PDF] Robustness in Machine Translation Evaluation.
N Mathur - 2021 - core.ac.uk
Nitika Mathur, Timothy Baldwin, and Trevor Cohn. Towards efficient machine translation
evaluation by modelling annotators. In Proceedings of the Australasian Language …
evaluation by modelling annotators. In Proceedings of the Australasian Language …
Calibration and context in human evaluation of machine translation
Human evaluation of machine translation is considered the “gold standard” for evaluation,
but it remains a challenging task for which to define best practices. Recent work has focused …
but it remains a challenging task for which to define best practices. Recent work has focused …
[PDF][PDF] La lexicographie bilingue en traduction automatique d'une langue peu dotée: une chaîne opératoire pour l'amharique
M MARMONIER - ertim.inalco.fr
La traduction automatique, en tant que champ de recherche scientifique et de
développement technique, témoignait–au tournant des années 2020–d'un intérêt croissant …
développement technique, témoignait–au tournant des années 2020–d'un intérêt croissant …