- Academic Search

Y Perlitz, E Bandel, A Gera, O Arviv, L Ein-Dor… - arxiv preprint arxiv …, 2023 - arxiv.org

The increasing versatility of language models LMs has given rise to a new class of
benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks …

Opslaan Citeren Geciteerd door 27 Verwante artikelen Alle 4 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation

M Boubdir, E Kim, B Ermis, M Fadaee… - arxiv preprint arxiv …, 2023 - arxiv.org

Human evaluation is increasingly critical for assessing large language models, capturing
linguistic nuances, and reflecting user preferences more accurately than traditional …

Opslaan Citeren Geciteerd door 6 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Pitfalls and outlooks in using COMET

V Zouhar, P Chen, TK Lam, N Moghe… - arxiv preprint arxiv …, 2024 - arxiv.org

The COMET metric has blazed a trail in the machine translation community, given its strong
correlation with human judgements of translation quality. Its success stems from being a …

Opslaan Citeren Geciteerd door 2 Verwante artikelen Alle 7 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy

B Thompson, N Mathur, D Deutsch… - arxiv preprint arxiv …, 2024 - arxiv.org

Selecting an automatic metric that best emulates human annotators is often non-trivial,
because there is no clear definition of" best emulates." A meta-metric is required to compare …

Opslaan Citeren Verwante artikelen Alle 5 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

PS Varadhan, A Gulati, A Sankar, S Anand… - arxiv preprint arxiv …, 2024 - arxiv.org

Despite rapid advancements in TTS models, a consistent and robust human evaluation
framework is still lacking. For example, MOS tests fail to differentiate between similar …

Opslaan Citeren Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Finding Replicable Human Evaluations via Stable Ranking Probability

P Riley, D Deutsch, G Foster, V Ratnakar… - arxiv preprint arxiv …, 2024 - arxiv.org

Reliable human evaluation is critical to the development of successful natural language
generation models, but achieving it is notoriously difficult. Stability is a crucial requirement …

Opslaan Citeren Geciteerd door 1 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] aclanthology.org

Translation memories as baselines for low-resource machine translation

R Knowles, P Littell - … of the Thirteenth Language Resources and …, 2022 - aclanthology.org

Low-resource machine translation research often requires building baselines to benchmark
estimates of progress in translation quality. Neural and statistical phrase-based systems are …

Opslaan Citeren Geciteerd door 6 Verwante artikelen Alle 4 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] core.ac.uk

[PDF][PDF] Robustness in Machine Translation Evaluation.

N Mathur - 2021 - core.ac.uk

Nitika Mathur, Timothy Baldwin, and Trevor Cohn. Towards efficient machine translation
evaluation by modelling annotators. In Proceedings of the Australasian Language …

Opslaan Citeren Geciteerd door 2 Verwante artikelen Alle 2 versies

[Free GPT-4]
[DeepSeek]

[PDF] cambridge.org

Calibration and context in human evaluation of machine translation

R Knowles, C Lo - Natural Language Processing - cambridge.org

Human evaluation of machine translation is considered the “gold standard” for evaluation,
but it remains a challenging task for which to define best practices. Recent work has focused …

Opslaan Citeren Geciteerd door 1 Verwante artikelen

[Free GPT-4]
[DeepSeek]

[PDF] inalco.fr

[PDF][PDF] La lexicographie bilingue en traduction automatique d'une langue peu dotée: une chaîne opératoire pour l'amharique

M MARMONIER - ertim.inalco.fr

La traduction automatique, en tant que champ de recherche scientifique et de
développement technique, témoignait–au tournant des années 2020–d'un intérêt croissant …

Opslaan Citeren Verwante artikelen HTML-versie

Melding maken

Citeren

Geavanceerd zoeken

Opgeslagen in Mijn bibliotheek

On the stability of system rankings at WMT

Efficient benchmarking (of language models)

Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation

Pitfalls and outlooks in using COMET

Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy

Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

Finding Replicable Human Evaluations via Stable Ranking Probability

Translation memories as baselines for low-resource machine translation

[PDF][PDF] Robustness in Machine Translation Evaluation.

Calibration and context in human evaluation of machine translation

[PDF][PDF] La lexicographie bilingue en traduction automatique d'une langue peu dotée: une chaîne opératoire pour l'amharique