A survey of language model confidence estimation and calibration

J Geng, F Cai, Y Wang, H Koeppl, P Nakov… - arxiv preprint arxiv …, 2023 - arxiv.org
Language models (LMs) have demonstrated remarkable capabilities across a wide range of
tasks in various domains. Despite their impressive performance, the reliability of their output …

A Survey of Confidence Estimation and Calibration in Large Language Models

J Geng, F Cai, Y Wang, H Koeppl… - Proceedings of the …, 2024 - aclanthology.org
Large language models (LLMs) have demonstrated remarkable capabilities across a wide
range of tasks in various domains. Despite their impressive performance, they can be …

Adaptation with self-evaluation to improve selective prediction in llms

J Chen, J Yoon, S Ebrahimi, SO Arik, T Pfister… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models (LLMs) have recently shown great advances in a variety of tasks,
including natural language understanding and generation. However, their use in high …

Mitigating temporal misalignment by discarding outdated facts

MJQ Zhang, E Choi - arxiv preprint arxiv:2305.14824, 2023 - arxiv.org
While large language models are able to retain vast amounts of world knowledge seen
during pretraining, such knowledge is prone to going out of date and is nontrivial to update …

Do llms know when to not answer? investigating abstention abilities of large language models

N Madhusudhan, ST Madhusudhan, V Yadav… - arxiv preprint arxiv …, 2024 - arxiv.org
Abstention Ability (AA) is a critical aspect of Large Language Model (LLM) reliability,
referring to an LLM's capability to withhold responses when uncertain or lacking a definitive …

Can NLP Models' Identify','Distinguish', and'Justify'Questions that Don't have a Definitive Answer?

A Agarwal, N Patel, N Varshney, M Parmar… - arxiv preprint arxiv …, 2023 - arxiv.org
Though state-of-the-art (SOTA) NLP systems have achieved remarkable performance on a
variety of language understanding tasks, they primarily focus on questions that have a …

Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks?

U Khurana, E Nalisnick, A Fokkens… - arxiv preprint arxiv …, 2024 - arxiv.org
Subjective tasks in NLP have been mostly relegated to objective standards, where the gold
label is decided by taking the majority vote. This obfuscates annotator disagreement and the …

Accelerating llm inference by enabling intermediate layer decoding

N Varshney, A Chatterjee, M Parmar… - arxiv preprint arxiv …, 2023 - arxiv.org
Large Language Models (LLMs) have achieved remarkable performance across a wide
variety of natural language tasks; however, their large size makes their inference slow and …

Ambiguity meets uncertainty: Investigating uncertainty estimation for word sense disambiguation

Z Liu, Y Liu - arxiv preprint arxiv:2305.13119, 2023 - arxiv.org
Word sense disambiguation (WSD), which aims to determine an appropriate sense for a
target word given its context, is crucial for natural language understanding. Existing …

LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements

V Basmov, Y Goldberg, R Tsarfaty - arxiv preprint arxiv:2404.06283, 2024 - arxiv.org
The task of reading comprehension (RC), often implemented as context-based question
answering (QA), provides a primary means to assess language models' natural language …