On student-teacher deviations in distillation: does it pay to disobey?

V Nagarajan, AK Menon… - Advances in …, 2023 - proceedings.neurips.cc
Abstract Knowledge distillation (KD) has been widely used to improve the test accuracy of a"
student" network, by training it to mimic the soft probabilities of a trained" teacher" network …

Cluster-aware semi-supervised learning: Relational knowledge distillation provably learns clustering

Y Dong, K Miller, Q Lei, R Ward - Advances in Neural …, 2023 - proceedings.neurips.cc
Despite the empirical success and practical significance of (relational) knowledge distillation
that matches (the relations of) features between teacher and student models, the …

Data upcycling knowledge distillation for image super-resolution

Y Zhang, W Li, S Li, H Chen, Z Tu, W Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
Knowledge distillation (KD) compresses deep neural networks by transferring task-related
knowledge from cumbersome pre-trained teacher models to compact student models …

A little help goes a long way: Efficient llm training by leveraging small lms

AS Rawat, V Sadhanala, A Rostamizadeh… - arxiv preprint arxiv …, 2024 - arxiv.org
A primary challenge in large language model (LLM) development is their onerous pre-
training cost. Typically, such pre-training involves optimizing a self-supervised objective …

Towards the fundamental limits of knowledge transfer over finite domains

Q Zhao, B Zhu - arxiv preprint arxiv:2310.07838, 2023 - arxiv.org
We characterize the statistical efficiency of knowledge transfer through $ n $ samples from a
teacher to a probabilistic student classifier with input space $\mathcal S $ over labels …

The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information

D Wu, IV Modoranu, M Safaryan… - Advances in …, 2025 - proceedings.neurips.cc
The rising footprint of machine learning has led to a focus on imposing model sparsity as a
means of reducing computational and memory costs. For deep neural networks (DNNs), the …

Learning Neural Networks with Sparse Activations

P Awasthi, N Dikkala, P Kamath… - The Thirty Seventh …, 2024 - proceedings.mlr.press
A core component present in many successful neural network architectures, is an MLP block
of two fully connected layers with a non-linear activation in between. An intriguing …

Progressive distillation induces an implicit curriculum

A Panigrahi, B Liu, S Malladi, A Risteski… - arxiv preprint arxiv …, 2024 - arxiv.org
Knowledge distillation leverages a teacher model to improve the training of a student model.
A persistent challenge is that a better teacher does not always yield a better student, to …

Distillation Scaling Laws

D Busbridge, A Shidani, F Weers, J Ramapuram… - arxiv preprint arxiv …, 2025 - arxiv.org
We provide a distillation scaling law that estimates distilled model performance based on a
compute budget and its allocation between the student and teacher. Our findings reduce the …

On information captured by neural networks: connections with memorization and generalization

H Harutyunyan - arxiv preprint arxiv:2306.15918, 2023 - arxiv.org
Despite the popularity and success of deep learning, there is limited understanding of when,
how, and why neural networks generalize to unseen examples. Since learning can be seen …