Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Data and its (dis) contents: A survey of dataset development and use in machine learning research
In this work, we survey a breadth of literature that has revealed the limitations of
predominant practices for dataset collection and use in the field of machine learning. We …
predominant practices for dataset collection and use in the field of machine learning. We …
From 'F'to 'A'on the NY regents science exams: An overview of the aristo project
AI has achieved remarkable mastery over games such as Chess, Go, and Poker, and even
Jeopardy!, but the rich variety of standardized exams has remained a landmark challenge …
Jeopardy!, but the rich variety of standardized exams has remained a landmark challenge …
Dynabench: Rethinking benchmarking in NLP
We introduce Dynabench, an open-source platform for dynamic dataset creation and model
benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the …
benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the …
Pretrained transformers improve out-of-distribution robustness
Although pretrained Transformers such as BERT achieve high accuracy on in-distribution
examples, do they generalize to new distributions? We systematically measure out-of …
examples, do they generalize to new distributions? We systematically measure out-of …
Superglue: A stickier benchmark for general-purpose language understanding systems
In the last year, new models and methods for pretraining and transfer learning have driven
striking performance improvements across a range of language understanding tasks. The …
striking performance improvements across a range of language understanding tasks. The …
HateCheck: Functional tests for hate speech detection models
Detecting online hate is a difficult task that even state-of-the-art models struggle with.
Typically, hate speech detection models are evaluated by measuring their performance on …
Typically, hate speech detection models are evaluated by measuring their performance on …
Certified robustness to adversarial word substitutions
State-of-the-art NLP models can often be fooled by adversaries that apply seemingly
innocuous label-preserving transformations (eg, paraphrasing) to input text. The number of …
innocuous label-preserving transformations (eg, paraphrasing) to input text. The number of …
Measure and improve robustness in NLP models: A survey
As NLP models achieved state-of-the-art performances over benchmarks and gained wide
applications, it has been increasingly important to ensure the safe deployment of these …
applications, it has been increasingly important to ensure the safe deployment of these …
MRQA 2019 shared task: Evaluating generalization in reading comprehension
We present the results of the Machine Reading for Question Answering (MRQA) 2019
shared task on evaluating the generalization capabilities of reading comprehension …
shared task on evaluating the generalization capabilities of reading comprehension …
How can we accelerate progress towards human-like linguistic generalization?
T Linzen - arxiv preprint arxiv:2005.00955, 2020 - arxiv.org
This position paper describes and critiques the Pretraining-Agnostic Identically Distributed
(PAID) evaluation paradigm, which has become a central tool for measuring progress in …
(PAID) evaluation paradigm, which has become a central tool for measuring progress in …