- Academic Search

H Gupta, N Varshney, S Mishra, KK Pal… - arxiv preprint arxiv …, 2022 - arxiv.org

In current NLP research, large-scale language models and their abilities are widely being
discussed. Some recent works have also found notable failures of these models. Often these …

Save Cite Cited by 19 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Model cascading: Towards jointly improving efficiency and accuracy of NLP systems

N Varshney, C Baral - arxiv preprint arxiv:2210.05528, 2022 - arxiv.org

Do all instances need inference through the big models for a correct prediction? Perhaps
not; some instances are easy and can be answered correctly by even small capacity models …

Save Cite Cited by 17 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] aaai.org

Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling

J Ruan, X Pu, M Gao, X Wan, Y Zhu - Proceedings of the AAAI …, 2024 - ojs.aaai.org

Human evaluation is viewed as a reliable evaluation method for NLG which is expensive
and time-consuming. In order to save labor and costs, researchers usually perform human …

Save Cite Cited by 4 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Towards improving selective prediction ability of nlp systems

N Varshney, S Mishra, C Baral - arxiv preprint arxiv:2008.09371, 2020 - arxiv.org

It's better to say" I can't answer" than to answer incorrectly. This selective prediction ability is
crucial for NLP systems to be reliably deployed in real-world applications. Prior work has …

Save Cite Cited by 26 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Can Open-Domain QA Reader Utilize External Knowledge Efficiently like Humans?

N Varshney, M Luo, C Baral - arxiv preprint arxiv:2211.12707, 2022 - arxiv.org

Recent state-of-the-art open-domain QA models are typically based on a two stage retriever-
reader approach in which the retriever first finds the relevant knowledge/passages and the …

Save Cite Cited by 10 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] mit.edu

Discover, Explain, Improve: An Automatic Slice Detection Benchmark for Natural Language Processing

W Hua, L **, L Song, H Mi, Y Zhang… - Transactions of the …, 2023 - direct.mit.edu

Pretrained natural language processing (NLP) models have achieved high overall
performance, but they still make systematic errors. Instead of manual error analysis …

Save Cite Cited by 1 Related articles All 6 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

ActiveAED: A human in the loop improves annotation error detection

L Weber, B Plank - arxiv preprint arxiv:2305.20045, 2023 - arxiv.org

Manually annotated datasets are crucial for training and evaluating Natural Language
Processing models. However, recent work has discovered that even widely-used benchmark …

Save Cite Cited by 5 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Assessing out-of-domain language model performance from few examples

P Singhal, J Forristal, X Ye, G Durrett - arxiv preprint arxiv:2210.06725, 2022 - arxiv.org

While pretrained language models have exhibited impressive generalization capabilities,
they still behave unpredictably under certain domain shifts. In particular, a model may learn …

Save Cite Cited by 5 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] ieee.org

Assessing and Scoring Difficulty of Hard-to-Solve Data in Summarization Tasks

J Jung, H Seo, H Namgoong, S Jung - IEEE Access, 2024 - ieeexplore.ieee.org

In recent years, data-driven and machine learning-based natural language processing
(NLP) technologies have effectively addressed various challenges. To further enhance the …

Save Cite Related articles

[Free GPT-4]

[PDF] arxiv.org

Vote'n'Rank: Revision of Benchmarking with Social Choice Theory

M Rofin, V Mikhailov, M Florinskiy… - arxiv preprint arxiv …, 2022 - arxiv.org

The development of state-of-the-art systems in different applied areas of machine learning
(ML) is driven by benchmarks, which have shaped the paradigm of evaluating …

Save Cite Cited by 7 Related articles All 6 versions Free GPT-4 View as HTML

Create alert

Cite

Advanced search

Saved to My library

ILDAE: Instance-level difficulty analysis of evaluation data

" John is 50 years old, can his son be 65?" Evaluating NLP Models' Understanding of Feasibility

Model cascading: Towards jointly improving efficiency and accuracy of NLP systems

Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling

Towards improving selective prediction ability of nlp systems

Can Open-Domain QA Reader Utilize External Knowledge Efficiently like Humans?

Discover, Explain, Improve: An Automatic Slice Detection Benchmark for Natural Language Processing

ActiveAED: A human in the loop improves annotation error detection

Assessing out-of-domain language model performance from few examples

Assessing and Scoring Difficulty of Hard-to-Solve Data in Summarization Tasks

Vote'n'Rank: Revision of Benchmarking with Social Choice Theory