" John is 50 years old, can his son be 65?" Evaluating NLP Models' Understanding of Feasibility

H Gupta, N Varshney, S Mishra, KK Pal… - arxiv preprint arxiv …, 2022 - arxiv.org
In current NLP research, large-scale language models and their abilities are widely being
discussed. Some recent works have also found notable failures of these models. Often these …

Model cascading: Towards jointly improving efficiency and accuracy of NLP systems

N Varshney, C Baral - arxiv preprint arxiv:2210.05528, 2022 - arxiv.org
Do all instances need inference through the big models for a correct prediction? Perhaps
not; some instances are easy and can be answered correctly by even small capacity models …

Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling

J Ruan, X Pu, M Gao, X Wan, Y Zhu - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Human evaluation is viewed as a reliable evaluation method for NLG which is expensive
and time-consuming. In order to save labor and costs, researchers usually perform human …

Towards improving selective prediction ability of nlp systems

N Varshney, S Mishra, C Baral - arxiv preprint arxiv:2008.09371, 2020 - arxiv.org
It's better to say" I can't answer" than to answer incorrectly. This selective prediction ability is
crucial for NLP systems to be reliably deployed in real-world applications. Prior work has …

Can Open-Domain QA Reader Utilize External Knowledge Efficiently like Humans?

N Varshney, M Luo, C Baral - arxiv preprint arxiv:2211.12707, 2022 - arxiv.org
Recent state-of-the-art open-domain QA models are typically based on a two stage retriever-
reader approach in which the retriever first finds the relevant knowledge/passages and the …

Discover, Explain, Improve: An Automatic Slice Detection Benchmark for Natural Language Processing

W Hua, L **, L Song, H Mi, Y Zhang… - Transactions of the …, 2023 - direct.mit.edu
Pretrained natural language processing (NLP) models have achieved high overall
performance, but they still make systematic errors. Instead of manual error analysis …

ActiveAED: A human in the loop improves annotation error detection

L Weber, B Plank - arxiv preprint arxiv:2305.20045, 2023 - arxiv.org
Manually annotated datasets are crucial for training and evaluating Natural Language
Processing models. However, recent work has discovered that even widely-used benchmark …

Assessing out-of-domain language model performance from few examples

P Singhal, J Forristal, X Ye, G Durrett - arxiv preprint arxiv:2210.06725, 2022 - arxiv.org
While pretrained language models have exhibited impressive generalization capabilities,
they still behave unpredictably under certain domain shifts. In particular, a model may learn …

Assessing and Scoring Difficulty of Hard-to-Solve Data in Summarization Tasks

J Jung, H Seo, H Namgoong, S Jung - IEEE Access, 2024 - ieeexplore.ieee.org
In recent years, data-driven and machine learning-based natural language processing
(NLP) technologies have effectively addressed various challenges. To further enhance the …

Vote'n'Rank: Revision of Benchmarking with Social Choice Theory

M Rofin, V Mikhailov, M Florinskiy… - arxiv preprint arxiv …, 2022 - arxiv.org
The development of state-of-the-art systems in different applied areas of machine learning
(ML) is driven by benchmarks, which have shaped the paradigm of evaluating …