Self-play fine-tuning converts weak language models to strong language models

Z Chen, Y Deng, H Yuan, K Ji, Q Gu - arxiv preprint arxiv:2401.01335, 2024 - arxiv.org
Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is
pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the …

Lora learns less and forgets less

D Biderman, J Portes, JJG Ortiz, M Paul… - arxiv preprint arxiv …, 2024 - arxiv.org
Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for
large language models. LoRA saves memory by training only low rank perturbations to …

Many-shot in-context learning

R Agarwal, A Singh, LM Zhang, B Bohnet… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) excel at few-shot in-context learning (ICL)--learning from a
few examples provided in context at inference, without any weight updates. Newly expanded …

[HTML][HTML] Self-training: A survey

MR Amini, V Feofanov, L Pauletto, L Hadjadj… - Neurocomputing, 2025 - Elsevier
Self-training methods have gained significant attention in recent years due to their
effectiveness in leveraging small labeled datasets and large unlabeled observations for …

Rest-mcts*: Llm self-training via process reward guided tree search

D Zhang, S Zhoubian, Z Hu, Y Yue, Y Dong… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent methodologies in LLM self-training mostly rely on LLM generating responses and
filtering those with correct output answers as training data. This approach often yields a low …

Training language models to self-correct via reinforcement learning

A Kumar, V Zhuang, R Agarwal, Y Su… - arxiv preprint arxiv …, 2024 - arxiv.org
Self-correction is a highly desirable capability of large language models (LLMs), yet it has
consistently been found to be largely ineffective in modern LLMs. Current methods for …

Generative verifiers: Reward modeling as next-token prediction

L Zhang, A Hosseini, H Bansal, M Kazemi… - arxiv preprint arxiv …, 2024 - arxiv.org
Verifiers or reward models are often used to enhance the reasoning performance of large
language models (LLMs). A common approach is the Best-of-N method, where N candidate …

Llm2llm: Boosting llms with novel iterative data enhancement

N Lee, T Wattanawong, S Kim, K Mangalam… - arxiv preprint arxiv …, 2024 - arxiv.org
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast
majority of natural language processing tasks. While many real-world applications still …

Smaller, weaker, yet better: Training llm reasoners via compute-optimal sampling

H Bansal, A Hosseini, R Agarwal, VQ Tran… - arxiv preprint arxiv …, 2024 - arxiv.org
Training on high-quality synthetic data from strong language models (LMs) is a common
strategy to improve the reasoning performance of LMs. In this work, we revisit whether this …

A survey on knowledge distillation of large language models

X Xu, M Li, C Tao, T Shen, R Cheng, J Li, C Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
This survey presents an in-depth exploration of knowledge distillation (KD) techniques
within the realm of Large Language Models (LLMs), spotlighting the pivotal role of KD in …