Mathematical discoveries from program search with large language models

B Romera-Paredes, M Barekatain, A Novikov, M Balog… - Nature, 2024‏ - nature.com
Large language models (LLMs) have demonstrated tremendous capabilities in solving
complex tasks, from quantitative reasoning to understanding natural language. However …

Quiet-star: Language models can teach themselves to think before speaking

E Zelikman, GR Harik, Y Shao, V Jayasiri… - First Conference on …, 2024‏ - openreview.net
When writing and talking, people sometimes pause to think. Although reasoning-focused
works have often framed reasoning as a method of answering questions or completing …

Codereval: A benchmark of pragmatic code generation with generative pre-trained models

H Yu, B Shen, D Ran, J Zhang, Q Zhang, Y Ma… - Proceedings of the 46th …, 2024‏ - dl.acm.org
Code generation models based on the pre-training and fine-tuning paradigm have been
increasingly attempted by both academia and industry, resulting in well-known industrial …

Buffer of thoughts: Thought-augmented reasoning with large language models

L Yang, Z Yu, T Zhang, S Cao, M Xu… - Advances in …, 2025‏ - proceedings.neurips.cc
Abstract We introduce Buffer of Thoughts (BoT), a novel and versatile thought-augmented
reasoning approach for enhancing accuracy, efficiency and robustness of large language …

Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement

L Qiu, L Jiang, X Lu, M Sclar, V Pyatkin… - arxiv preprint arxiv …, 2023‏ - arxiv.org
The ability to derive underlying principles from a handful of observations and then
generalize to novel situations--known as inductive reasoning--is central to human …

Cruxeval: A benchmark for code reasoning, understanding and execution

A Gu, B Rozière, H Leather, A Solar-Lezama… - arxiv preprint arxiv …, 2024‏ - arxiv.org
We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a
benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an …

If llm is the wizard, then code is the wand: A survey on how code empowers large language models to serve as intelligent agents

K Yang, J Liu, J Wu, C Yang, YR Fung, S Li… - arxiv preprint arxiv …, 2024‏ - arxiv.org
The prominent large language models (LLMs) of today differ from past language models not
only in size, but also in the fact that they are trained on a combination of natural language …

Selfevolve: A code evolution framework via large language models

S Jiang, Y Wang, Y Wang - arxiv preprint arxiv:2306.02907, 2023‏ - arxiv.org
Large language models (LLMs) have already revolutionized code generation, after being
pretrained on publicly available code data. However, while various methods have been …

Language model crossover: Variation through few-shot prompting

E Meyerson, MJ Nelson, H Bradley, A Gaier… - ACM Transactions on …, 2024‏ - dl.acm.org
This article pursues the insight that language models naturally enable an intelligent variation
operator similar in spirit to evolutionary crossover. In particular, language models of …

Execution-based evaluation for open-domain code generation

Z Wang, S Zhou, D Fried, G Neubig - arxiv preprint arxiv:2212.10481, 2022‏ - arxiv.org
To extend the scope of coding queries to more realistic settings, we propose ODEX, the first
Open-Domain EXecution-based natural language (NL) to Python code generation dataset …