Self-play fine-tuning converts weak language models to strong language models

Z Chen, Y Deng, H Yuan, K Ji, Q Gu - arxiv preprint arxiv:2401.01335, 2024 - arxiv.org
Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is
pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the …

Many-shot in-context learning

R Agarwal, A Singh, L Zhang… - Advances in …, 2025 - proceedings.neurips.cc
Large language models (LLMs) excel at few-shot in-context learning (ICL)--learning from a
few examples provided in context at inference, without any weight updates. Newly expanded …

Iterative reasoning preference optimization

RY Pang, W Yuan, H He, K Cho… - Advances in …, 2025 - proceedings.neurips.cc
Iterative preference optimization methods have recently been shown to perform well for
general instruction tuning tasks, but typically make little improvement on reasoning tasks. In …

Rest-mcts*: Llm self-training via process reward guided tree search

D Zhang, S Zhoubian, Z Hu, Y Yue… - Advances in Neural …, 2025 - proceedings.neurips.cc
Recent methodologies in LLM self-training mostly rely on LLM generating responses and
filtering those with correct output answers as training data. This approach often yields a low …

A survey on knowledge distillation of large language models

X Xu, M Li, C Tao, T Shen, R Cheng, J Li, C Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
In the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a
pivotal methodology for transferring advanced capabilities from leading proprietary LLMs …

Lora learns less and forgets less

D Biderman, J Portes, JJG Ortiz, M Paul… - … on Machine Learning …, 2024 - openreview.net
Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for
large language models. LoRA saves memory by training only low rank perturbations to …

Self-play preference optimization for language model alignment

Y Wu, Z Sun, H Yuan, K Ji, Y Yang, Q Gu - arxiv preprint arxiv:2405.00675, 2024 - arxiv.org
Standard reinforcement learning from human feedback (RLHF) approaches relying on
parametric models like the Bradley-Terry model fall short in capturing the intransitivity and …

Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving

Y Tong, X Zhang, R Wang, R Wu… - Advances in Neural …, 2025 - proceedings.neurips.cc
Solving mathematical problems requires advanced reasoning abilities and presents notable
challenges for large language models. Previous works usually synthesize data from …

[HTML][HTML] Self-training: A survey

MR Amini, V Feofanov, L Pauletto, L Hadjadj… - Neurocomputing, 2025 - Elsevier
Self-training methods have gained significant attention in recent years due to their
effectiveness in leveraging small labeled datasets and large unlabeled observations for …

Training language models to self-correct via reinforcement learning

A Kumar, V Zhuang, R Agarwal, Y Su… - arxiv preprint arxiv …, 2024 - arxiv.org
Self-correction is a highly desirable capability of large language models (LLMs), yet it has
consistently been found to be largely ineffective in modern LLMs. Current methods for …