SESCORE2: Learning text generation evaluation via synthesizing realistic mistakes

W Xu, X Qian, M Wang, L Li, WY Wang - arxiv preprint arxiv:2212.09305, 2022 - arxiv.org
Is it possible to train a general metric for evaluating text generation quality without human
annotated ratings? Existing learned metrics either perform unsatisfactorily across text …

Multilingual conceptual coverage in text-to-image models

M Saxon, WY Wang - arxiv preprint arxiv:2306.01735, 2023 - arxiv.org
We propose" Conceptual Coverage Across Languages"(CoCo-CroLa), a technique for
benchmarking the degree to which any generative text-to-image system provides …

A review of faithfulness metrics for hallucination assessment in Large Language Models

B Malin, T Kalganova, N Boulgouris - arxiv preprint arxiv:2501.00269, 2024 - arxiv.org
This review examines the means with which faithfulness has been evaluated across open-
ended summarization, question-answering and machine translation tasks. We find that the …

Towards fine-grained information: Identifying the type and location of translation errors

K Bao, Y Wan, D Liu, B Yang, W Lei, X He… - arxiv preprint arxiv …, 2023 - arxiv.org
Fine-grained information on translation errors is helpful for the translation evaluation
community. Existing approaches can not synchronously consider error position and type …

Translation Canvas: An Explainable Interface to Pinpoint and Analyze Translation Systems

C Dandekar, W Xu, X Xu, S Ouyang, L Li - arxiv preprint arxiv:2410.10861, 2024 - arxiv.org
With the rapid advancement of machine translation research, evaluation toolkits have
become essential for benchmarking system progress. Tools like COMET and SacreBLEU …

Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models

Q Lu, B Qiu, L Ding, K Zhang, T Kocmi… - arxiv preprint arxiv …, 2023 - arxiv.org
Generative large language models (LLMs), eg, ChatGPT, have demonstrated remarkable
proficiency across several NLP tasks, such as machine translation, text summarization …