Data and its (dis) contents: A survey of dataset development and use in machine learning research

A Paullada, ID Raji, EM Bender, E Denton, A Hanna - Patterns, 2021 - cell.com
In this work, we survey a breadth of literature that has revealed the limitations of
predominant practices for dataset collection and use in the field of machine learning. We …

Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension

A Rogers, M Gardner, I Augenstein - ACM Computing Surveys, 2023 - dl.acm.org
Alongside huge volumes of research on deep learning models in NLP in the recent years,
there has been much work on benchmark datasets needed to track modeling progress …

A primer in BERTology: What we know about how BERT works

A Rogers, O Kovaleva, A Rumshisky - Transactions of the Association …, 2021 - direct.mit.edu
Transformer-based models have pushed state of the art in many areas of NLP, but our
understanding of what is behind their success is still limited. This paper is the first survey of …

Dynabench: Rethinking benchmarking in NLP

D Kiela, M Bartolo, Y Nie, D Kaushik, A Geiger… - arxiv preprint arxiv …, 2021 - arxiv.org
We introduce Dynabench, an open-source platform for dynamic dataset creation and model
benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the …

AI and the everything in the whole wide world benchmark

ID Raji, EM Bender, A Paullada, E Denton… - arxiv preprint arxiv …, 2021 - arxiv.org
There is a tendency across different subfields in AI to valorize a small collection of influential
benchmarks. These benchmarks operate as stand-ins for a range of anointed common …

Question and answer test-train overlap in open-domain question answering datasets

P Lewis, P Stenetorp, S Riedel - arxiv preprint arxiv:2008.02637, 2020 - arxiv.org
Ideally Open-Domain Question Answering models should exhibit a number of
competencies, ranging from simply memorizing questions seen at training time, to …

What will it take to fix benchmarking in natural language understanding?

SR Bowman, GE Dahl - arxiv preprint arxiv:2104.02145, 2021 - arxiv.org
Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and
biased systems score so highly on standard benchmarks that there is little room for …

Wanli: Worker and ai collaboration for natural language inference dataset creation

A Liu, S Swayamdipta, NA Smith, Y Choi - arxiv preprint arxiv:2201.05955, 2022 - arxiv.org
A recurring challenge of crowdsourcing NLP datasets at scale is that human writers often
rely on repetitive patterns when crafting examples, leading to a lack of linguistic diversity. We …

Out of order: How important is the sequential order of words in a sentence in natural language understanding tasks?

TM Pham, T Bui, L Mai, A Nguyen - arxiv preprint arxiv:2012.15180, 2020 - arxiv.org
Do state-of-the-art natural language understanding models care about word order-one of the
most important characteristics of a sequence? Not always! We found 75% to 90% of the …

Beat the AI: Investigating adversarial human annotation for reading comprehension

M Bartolo, A Roberts, J Welbl, S Riedel… - Transactions of the …, 2020 - direct.mit.edu
Innovations in annotation methodology have been a catalyst for Reading Comprehension
(RC) datasets and models. One recent trend to challenge current RC models is to involve a …