Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

An integrative survey on mental health conversational agents to bridge computer science and medical perspectives

YM Cho, S Rai, L Ungar, J Sedoc… - Proceedings of the …, 2023 - pmc.ncbi.nlm.nih.gov
Mental health conversational agents (aka chatbots) are widely studied for their potential to
offer accessible support to those experiencing mental health challenges. Previous surveys …

[PDF][PDF] Ai transparency in the age of llms: A human-centered research roadmap

QV Liao, JW Vaughan - ar** Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets
SL Blodgett, G Lopez, A Olteanu, R Sim… - Proceedings of the …, 2021 - aclanthology.org
Auditing NLP systems for computational harms like surfacing stereotypes is an elusive goal.
Several recent efforts have focused on benchmark datasets consisting of pairs of contrastive …

" I'm sorry to hear that": Finding New Biases in Language Models with a Holistic Descriptor Dataset

EM Smith, M Hall, M Kambadur, E Presani… - arxiv preprint arxiv …, 2022 - arxiv.org
As language models grow in popularity, it becomes increasingly important to clearly
measure all possible markers of demographic identity in order to avoid perpetuating existing …

Evaluation of text generation: A survey

A Celikyilmaz, E Clark, J Gao - arxiv preprint arxiv:2006.14799, 2020 - arxiv.org
The paper surveys evaluation methods of natural language generation (NLG) systems that
have been developed in the last few years. We group NLG evaluation methods into three …

Measuring attribution in natural language generation models

H Rashkin, V Nikolaev, M Lamm, L Aroyo… - Computational …, 2023 - direct.mit.edu
Large neural models have brought a new challenge to natural language generation (NLG): It
has become imperative to ensure the safety and reliability of the output of models that …

Is GPT-3 text indistinguishable from human text? scarecrow: A framework for scrutinizing machine text

Y Dou, M Forbes, R Koncel-Kedziorski… - arxiv preprint arxiv …, 2021 - arxiv.org
Modern neural language models can produce remarkably fluent and grammatical text. So
much, in fact, that recent work by Clark et al.(2021) has reported that conventional …

The perils of using Mechanical Turk to evaluate open-ended text generation

M Karpinska, N Akoury, M Iyyer - arxiv preprint arxiv:2109.06835, 2021 - arxiv.org
Recent text generation research has increasingly focused on open-ended domains such as
story and poetry generation. Because models built for such tasks are difficult to evaluate …