Data and its (dis) contents: A survey of dataset development and use in machine learning research

A Paullada, ID Raji, EM Bender, E Denton, A Hanna - Patterns, 2021 - cell.com
In this work, we survey a breadth of literature that has revealed the limitations of
predominant practices for dataset collection and use in the field of machine learning. We …

Superglue: A stickier benchmark for general-purpose language understanding systems

A Wang, Y Pruksachatkun, N Nangia… - Advances in neural …, 2019 - proceedings.neurips.cc
In the last year, new models and methods for pretraining and transfer learning have driven
striking performance improvements across a range of language understanding tasks. The …

From 'F'to 'A'on the NY regents science exams: An overview of the aristo project

P Clark, O Etzioni, T Khot, D Khashabi, B Mishra… - Ai Magazine, 2020 - ojs.aaai.org
AI has achieved remarkable mastery over games such as Chess, Go, and Poker, and even
Jeopardy!, but the rich variety of standardized exams has remained a landmark challenge …

Dynabench: Rethinking benchmarking in NLP

D Kiela, M Bartolo, Y Nie, D Kaushik, A Geiger… - arxiv preprint arxiv …, 2021 - arxiv.org
We introduce Dynabench, an open-source platform for dynamic dataset creation and model
benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the …

Pretrained transformers improve out-of-distribution robustness

D Hendrycks, X Liu, E Wallace, A Dziedzic… - arxiv preprint arxiv …, 2020 - arxiv.org
Although pretrained Transformers such as BERT achieve high accuracy on in-distribution
examples, do they generalize to new distributions? We systematically measure out-of …

HateCheck: Functional tests for hate speech detection models

P Röttger, B Vidgen, D Nguyen, Z Waseem… - arxiv preprint arxiv …, 2020 - arxiv.org
Detecting online hate is a difficult task that even state-of-the-art models struggle with.
Typically, hate speech detection models are evaluated by measuring their performance on …

Certified robustness to adversarial word substitutions

R Jia, A Raghunathan, K Göksel, P Liang - arxiv preprint arxiv …, 2019 - arxiv.org
State-of-the-art NLP models can often be fooled by adversaries that apply seemingly
innocuous label-preserving transformations (eg, paraphrasing) to input text. The number of …

MRQA 2019 shared task: Evaluating generalization in reading comprehension

A Fisch, A Talmor, R Jia, M Seo, E Choi… - arxiv preprint arxiv …, 2019 - arxiv.org
We present the results of the Machine Reading for Question Answering (MRQA) 2019
shared task on evaluating the generalization capabilities of reading comprehension …

Measure and improve robustness in NLP models: A survey

X Wang, H Wang, D Yang - arxiv preprint arxiv:2112.08313, 2021 - arxiv.org
As NLP models achieved state-of-the-art performances over benchmarks and gained wide
applications, it has been increasingly important to ensure the safe deployment of these …

An empirical study on robustness to spurious correlations using pre-trained language models

L Tu, G Lalwani, S Gella, H He - Transactions of the Association for …, 2020 - direct.mit.edu
Recent work has shown that pre-trained language models such as BERT improve
robustness to spurious correlations in the dataset. Intrigued by these results, we find that the …