Understanding self-distillation in the presence of label noise
Self-distillation (SD) is the process of first training a" teacher" model and then using its
predictions to train a" student" model that has the same architecture. Specifically, the …
predictions to train a" student" model that has the same architecture. Specifically, the …
SeqNAS: Neural architecture search for event sequence classification
Neural Architecture Search (NAS) methods are widely used in various industries to obtain
high-quality, task-specific solutions with minimal human intervention. Event Sequences …
high-quality, task-specific solutions with minimal human intervention. Event Sequences …
On student-teacher deviations in distillation: does it pay to disobey?
Abstract Knowledge distillation (KD) has been widely used to improve the test accuracy of a"
student" network, by training it to mimic the soft probabilities of a trained" teacher" network …
student" network, by training it to mimic the soft probabilities of a trained" teacher" network …
Induced Model Matching: Restricted Models Help Train Full-Featured Models
U Muneeb, MI Ohannessian - Advances in Neural …, 2025 - proceedings.neurips.cc
We consider scenarios where a very accurate (often small) predictive model using restricted
features is available when training a full-featured (often larger) model. This restricted model …
features is available when training a full-featured (often larger) model. This restricted model …
What Mechanisms Does Knowledge Distillation Distill?
Abstract Knowledge distillation is a commonly-used compression method in ML due to the
popularity of increasingly large-scale models, but it is unclear if all the information a teacher …
popularity of increasingly large-scale models, but it is unclear if all the information a teacher …
[PDF][PDF] Trans-LoRA: Towards Data-Free Transferable Parameter Efficient Finetuning
Low-rank adapters (LoRA) and their variants are popular parameter-efficient finetuning
(PEFT) techniques that closely match full model fine-tune performance while requiring only a …
(PEFT) techniques that closely match full model fine-tune performance while requiring only a …
: towards data-free Transferable Parameter Efficient Finetuning
Low-rank adapters (LoRA) and their variants are popular parameter-efficient fine-tuning
(PEFT) techniques that closely match full model fine-tune performance while requiring only a …
(PEFT) techniques that closely match full model fine-tune performance while requiring only a …
Bayesian Optimization Meets Self-Distillation
Bayesian optimization (BO) has contributed greatly to improving model performance by
suggesting promising hyperparameter configurations iteratively based on observations from …
suggesting promising hyperparameter configurations iteratively based on observations from …
LoRA-X: Bridging Foundation Models with Training-Free Cross-Model Adaptation
The rising popularity of large foundation models has led to a heightened demand for
parameter-efficient fine-tuning methods, such as Low-Rank Adaptation (LoRA), which offer …
parameter-efficient fine-tuning methods, such as Low-Rank Adaptation (LoRA), which offer …
Incremental Soft Pruning to Get the Sparse Neural Network During Training
K Zhu, F Hu, Y Ding, Y Dong… - 2024 International Joint …, 2024 - ieeexplore.ieee.org
The traditional three-stage pruning pipeline is first to train an original dense network, then
identify redundant parts of the network for pruning based on the evaluation metrics of the …
identify redundant parts of the network for pruning based on the evaluation metrics of the …