Data and its (dis) contents: A survey of dataset development and use in machine learning research
In this work, we survey a breadth of literature that has revealed the limitations of
predominant practices for dataset collection and use in the field of machine learning. We …
predominant practices for dataset collection and use in the field of machine learning. We …
Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension
Alongside huge volumes of research on deep learning models in NLP in the recent years,
there has been much work on benchmark datasets needed to track modeling progress …
there has been much work on benchmark datasets needed to track modeling progress …
A primer in BERTology: What we know about how BERT works
Transformer-based models have pushed state of the art in many areas of NLP, but our
understanding of what is behind their success is still limited. This paper is the first survey of …
understanding of what is behind their success is still limited. This paper is the first survey of …
Dynabench: Rethinking benchmarking in NLP
We introduce Dynabench, an open-source platform for dynamic dataset creation and model
benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the …
benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the …
AI and the everything in the whole wide world benchmark
There is a tendency across different subfields in AI to valorize a small collection of influential
benchmarks. These benchmarks operate as stand-ins for a range of anointed common …
benchmarks. These benchmarks operate as stand-ins for a range of anointed common …
Question and answer test-train overlap in open-domain question answering datasets
Ideally Open-Domain Question Answering models should exhibit a number of
competencies, ranging from simply memorizing questions seen at training time, to …
competencies, ranging from simply memorizing questions seen at training time, to …
What will it take to fix benchmarking in natural language understanding?
Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and
biased systems score so highly on standard benchmarks that there is little room for …
biased systems score so highly on standard benchmarks that there is little room for …
Wanli: Worker and ai collaboration for natural language inference dataset creation
A recurring challenge of crowdsourcing NLP datasets at scale is that human writers often
rely on repetitive patterns when crafting examples, leading to a lack of linguistic diversity. We …
rely on repetitive patterns when crafting examples, leading to a lack of linguistic diversity. We …
Out of order: How important is the sequential order of words in a sentence in natural language understanding tasks?
Do state-of-the-art natural language understanding models care about word order-one of the
most important characteristics of a sequence? Not always! We found 75% to 90% of the …
most important characteristics of a sequence? Not always! We found 75% to 90% of the …
Beat the AI: Investigating adversarial human annotation for reading comprehension
Innovations in annotation methodology have been a catalyst for Reading Comprehension
(RC) datasets and models. One recent trend to challenge current RC models is to involve a …
(RC) datasets and models. One recent trend to challenge current RC models is to involve a …