Data and its (dis) contents: A survey of dataset development and use in machine learning research
In this work, we survey a breadth of literature that has revealed the limitations of
predominant practices for dataset collection and use in the field of machine learning. We …
predominant practices for dataset collection and use in the field of machine learning. We …
Superglue: A stickier benchmark for general-purpose language understanding systems
In the last year, new models and methods for pretraining and transfer learning have driven
striking performance improvements across a range of language understanding tasks. The …
striking performance improvements across a range of language understanding tasks. The …
From 'F'to 'A'on the NY regents science exams: An overview of the aristo project
AI has achieved remarkable mastery over games such as Chess, Go, and Poker, and even
Jeopardy!, but the rich variety of standardized exams has remained a landmark challenge …
Jeopardy!, but the rich variety of standardized exams has remained a landmark challenge …
Dynabench: Rethinking benchmarking in NLP
We introduce Dynabench, an open-source platform for dynamic dataset creation and model
benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the …
benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the …
Pretrained transformers improve out-of-distribution robustness
Although pretrained Transformers such as BERT achieve high accuracy on in-distribution
examples, do they generalize to new distributions? We systematically measure out-of …
examples, do they generalize to new distributions? We systematically measure out-of …
HateCheck: Functional tests for hate speech detection models
Detecting online hate is a difficult task that even state-of-the-art models struggle with.
Typically, hate speech detection models are evaluated by measuring their performance on …
Typically, hate speech detection models are evaluated by measuring their performance on …
Certified robustness to adversarial word substitutions
State-of-the-art NLP models can often be fooled by adversaries that apply seemingly
innocuous label-preserving transformations (eg, paraphrasing) to input text. The number of …
innocuous label-preserving transformations (eg, paraphrasing) to input text. The number of …
MRQA 2019 shared task: Evaluating generalization in reading comprehension
We present the results of the Machine Reading for Question Answering (MRQA) 2019
shared task on evaluating the generalization capabilities of reading comprehension …
shared task on evaluating the generalization capabilities of reading comprehension …
Measure and improve robustness in NLP models: A survey
As NLP models achieved state-of-the-art performances over benchmarks and gained wide
applications, it has been increasingly important to ensure the safe deployment of these …
applications, it has been increasingly important to ensure the safe deployment of these …
An empirical study on robustness to spurious correlations using pre-trained language models
Recent work has shown that pre-trained language models such as BERT improve
robustness to spurious correlations in the dataset. Intrigued by these results, we find that the …
robustness to spurious correlations in the dataset. Intrigued by these results, we find that the …