When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs

R Kamoi, Y Zhang, N Zhang, J Han… - Transactions of the …, 2024 - direct.mit.edu
Self-correction is an approach to improving responses from large language models (LLMs)
by refining the responses using LLMs during inference. Prior work has proposed various self …

Next-generation database interfaces: A survey of llm-based text-to-sql

Z Hong, Z Yuan, Q Zhang, H Chen, J Dong… - arxiv preprint arxiv …, 2024 - arxiv.org
Generating accurate SQL from natural language questions (text-to-SQL) is a long-standing
challenge due to the complexities in user question understanding, database schema …

Siren's song in the AI ocean: a survey on hallucination in large language models

Y Zhang, Y Li, L Cui, D Cai, L Liu, T Fu… - arxiv preprint arxiv …, 2023 - arxiv.org
While large language models (LLMs) have demonstrated remarkable capabilities across a
range of downstream tasks, a significant concern revolves around their propensity to exhibit …

Augmented language models: a survey

G Mialon, R Dessì, M Lomeli, C Nalmpantis… - arxiv preprint arxiv …, 2023 - arxiv.org
This survey reviews works in which language models (LMs) are augmented with reasoning
skills and the ability to use tools. The former is defined as decomposing a potentially …

Large language models can be easily distracted by irrelevant context

F Shi, X Chen, K Misra, N Scales… - International …, 2023 - proceedings.mlr.press
Large language models have achieved impressive performance on various natural
language processing tasks. However, so far they have been evaluated primarily on …

Lever: Learning to verify language-to-code generation with execution

A Ni, S Iyer, D Radev, V Stoyanov… - International …, 2023 - proceedings.mlr.press
The advent of large language models trained on code (code LLMs) has led to significant
progress in language-to-code generation. State-of-the-art approaches in this area combine …

DS-1000: A natural and reliable benchmark for data science code generation

Y Lai, C Li, Y Wang, T Zhang, R Zhong… - International …, 2023 - proceedings.mlr.press
We introduce DS-1000, a code generation benchmark with a thousand data science
problems spanning seven Python libraries, such as Numpy and Pandas. Compared to prior …

Language models are multilingual chain-of-thought reasoners

F Shi, M Suzgun, M Freitag, X Wang, S Srivats… - arxiv preprint arxiv …, 2022 - arxiv.org
We evaluate the reasoning abilities of large language models in multilingual settings. We
introduce the Multilingual Grade School Math (MGSM) benchmark, by manually translating …

Codet: Code generation with generated tests

B Chen, F Zhang, A Nguyen, D Zan, Z Lin… - arxiv preprint arxiv …, 2022 - arxiv.org
The task of generating code solutions for a given programming problem can benefit from the
use of pre-trained language models such as Codex, which can produce multiple diverse …

Ask me anything: A simple strategy for prompting language models

S Arora, A Narayan, MF Chen, L Orr, N Guha… - arxiv preprint arxiv …, 2022 - arxiv.org
Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a
natural language prompt that demonstrates how to perform the task and no additional …