" John is 50 years old, can his son be 65?" Evaluating NLP Models' Understanding of Feasibility
In current NLP research, large-scale language models and their abilities are widely being
discussed. Some recent works have also found notable failures of these models. Often these …
discussed. Some recent works have also found notable failures of these models. Often these …
Model cascading: Towards jointly improving efficiency and accuracy of NLP systems
Do all instances need inference through the big models for a correct prediction? Perhaps
not; some instances are easy and can be answered correctly by even small capacity models …
not; some instances are easy and can be answered correctly by even small capacity models …
Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling
Human evaluation is viewed as a reliable evaluation method for NLG which is expensive
and time-consuming. In order to save labor and costs, researchers usually perform human …
and time-consuming. In order to save labor and costs, researchers usually perform human …
Towards improving selective prediction ability of nlp systems
It's better to say" I can't answer" than to answer incorrectly. This selective prediction ability is
crucial for NLP systems to be reliably deployed in real-world applications. Prior work has …
crucial for NLP systems to be reliably deployed in real-world applications. Prior work has …
Can Open-Domain QA Reader Utilize External Knowledge Efficiently like Humans?
Recent state-of-the-art open-domain QA models are typically based on a two stage retriever-
reader approach in which the retriever first finds the relevant knowledge/passages and the …
reader approach in which the retriever first finds the relevant knowledge/passages and the …
Discover, Explain, Improve: An Automatic Slice Detection Benchmark for Natural Language Processing
Pretrained natural language processing (NLP) models have achieved high overall
performance, but they still make systematic errors. Instead of manual error analysis …
performance, but they still make systematic errors. Instead of manual error analysis …
ActiveAED: A human in the loop improves annotation error detection
Manually annotated datasets are crucial for training and evaluating Natural Language
Processing models. However, recent work has discovered that even widely-used benchmark …
Processing models. However, recent work has discovered that even widely-used benchmark …
Assessing out-of-domain language model performance from few examples
While pretrained language models have exhibited impressive generalization capabilities,
they still behave unpredictably under certain domain shifts. In particular, a model may learn …
they still behave unpredictably under certain domain shifts. In particular, a model may learn …
Assessing and Scoring Difficulty of Hard-to-Solve Data in Summarization Tasks
In recent years, data-driven and machine learning-based natural language processing
(NLP) technologies have effectively addressed various challenges. To further enhance the …
(NLP) technologies have effectively addressed various challenges. To further enhance the …
Vote'n'Rank: Revision of Benchmarking with Social Choice Theory
The development of state-of-the-art systems in different applied areas of machine learning
(ML) is driven by benchmarks, which have shaped the paradigm of evaluating …
(ML) is driven by benchmarks, which have shaped the paradigm of evaluating …