The'Problem'of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation

B Plank - arxiv preprint arxiv:2211.02570, 2022‏ - arxiv.org
Human variation in labeling is often considered noise. Annotation projects for machine
learning (ML) aim at minimizing human label variation, with the assumption to maximize …

Learning from disagreement: A survey

AN Uma, T Fornaciari, D Hovy, S Paun, B Plank… - Journal of Artificial …, 2021‏ - jair.org
Abstract Many tasks in Natural Language Processing (NLP) and Computer Vision (CV) offer
evidence that humans disagree, from objective tasks such as part-of-speech tagging to more …

Computational models of anaphora

M Poesio, J Yu, S Paun, A Aloraini, P Lu… - Annual Review of …, 2023‏ - annualreviews.org
Interpreting anaphoric references is a fundamental aspect of our language competence that
has long attracted the attention of computational linguists. The appearance of ever-larger …

Inherent disagreements in human textual inferences

E Pavlick, T Kwiatkowski - Transactions of the Association for …, 2019‏ - direct.mit.edu
We analyze human's disagreements about the validity of natural language inferences. We
show that, very often, disagreements are not dismissible as annotation “noise”, but rather …

What will it take to fix benchmarking in natural language understanding?

SR Bowman, GE Dahl - arxiv preprint arxiv:2104.02145, 2021‏ - arxiv.org
Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and
biased systems score so highly on standard benchmarks that there is little room for …

Why don't you do it right? analysing annotators' disagreement in subjective tasks

M Sandri, E Leonardelli, S Tonelli… - Proceedings of the 17th …, 2023‏ - aclanthology.org
Annotators' disagreement in linguistic data has been recently the focus of multiple initiatives
aimed at raising awareness on issues related to 'majority voting'when aggregating diverging …

SemEval-2023 task 11: Learning with disagreements (LeWiDi)

E Leonardelli, A Uma, G Abercrombie… - arxiv preprint arxiv …, 2023‏ - arxiv.org
NLP datasets annotated with human judgments are rife with disagreements between the
judges. This is especially true for tasks depending on subjective judgments such as …

Investigating reasons for disagreement in natural language inference

NJ Jiang, MC Marneffe - Transactions of the Association for …, 2022‏ - direct.mit.edu
We investigate how disagreement in natural language inference (NLI) annotation arises. We
developed a taxonomy of disagreement sources with 10 categories spanning 3 high-level …

Quoref: A reading comprehension dataset with questions requiring coreferential reasoning

P Dasigi, NF Liu, A Marasović, NA Smith… - arxiv preprint arxiv …, 2019‏ - arxiv.org
Machine comprehension of texts longer than a single sentence often requires coreference
resolution. However, most current reading comprehension benchmarks do not contain …

What can we learn from collective human opinions on natural language inference data?

Y Nie, X Zhou, M Bansal - arxiv preprint arxiv:2010.03532, 2020‏ - arxiv.org
Despite the subjective nature of many NLP tasks, most NLU evaluations have focused on
using the majority label with presumably high agreement as the ground truth. Less attention …