Human evaluation of conversations is an open problem: comparing the sensitivity of various methods for evaluating dialogue agents

EM Smith, O Hsu, R Qian, S Roller, YL Boureau… - arxiv preprint arxiv …, 2022‏ - arxiv.org
At the heart of improving conversational AI is the open problem of how to evaluate
conversations. Issues with automatic metrics are well known (Liu et al., 2016, arxiv …

Don't Forget Your ABC's: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems

SE Finch, JD Finch, JD Choi - arxiv preprint arxiv:2212.09180, 2022‏ - arxiv.org
Despite tremendous advancements in dialogue systems, stable evaluation still requires
human judgments producing notoriously high-variance metrics due to their inherent …

Automatic evaluation and moderation of open-domain dialogue systems

C Zhang, J Sedoc, LF D'Haro, R Banchs… - arxiv preprint arxiv …, 2021‏ - arxiv.org
The development of Open-Domain Dialogue Systems (ODS) is a trending topic due to the
large number of research challenges, large societal and business impact, and advances in …

PoE: A panel of experts for generalized automatic dialogue assessment

C Zhang, LF D'Haro, Q Zhang… - … /ACM Transactions on …, 2023‏ - ieeexplore.ieee.org
Chatbots are expected to be knowledgeable across multiple domains, eg for daily chit-chat,
exchange of information, and grounding in emotional situations. To effectively measure the …

Psychological metrics for dialog system evaluation

S Giorgi, S Havaldar, F Ahmed, Z Akhtar… - arxiv preprint arxiv …, 2023‏ - arxiv.org
We present metrics for evaluating dialog systems through a psychologically-grounded"
human" lens in which conversational agents express a diversity of both states (eg, emotion) …

Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation

SE Finch, JD Finch, JD Choi - arxiv preprint arxiv:2309.07998, 2023‏ - arxiv.org
Human evaluation has been widely accepted as the standard for evaluating chat-oriented
dialogue systems. However, there is a significant variation in previous work regarding who …