Conditional adapters: Parameter-efficient transfer learning with fast inference

T Lei, J Bai, S Brahma, J Ainslie… - Advances in …, 2023 - proceedings.neurips.cc
Abstract We propose Conditional Adapter (CoDA), a parameter-efficient transfer learning
method that also improves inference efficiency. CoDA generalizes beyond standard adapter …

Distillspec: Improving speculative decoding via knowledge distillation

Y Zhou, K Lyu, AS Rawat, AK Menon… - arxiv preprint arxiv …, 2023 - arxiv.org
Speculative decoding (SD) accelerates large language model inference by employing a
faster draft model for generating multiple tokens, which are then verified in parallel by the …

Gkd: Generalized knowledge distillation for auto-regressive sequence models

R Agarwal, N Vieillard, P Stanczyk, S Ramos… - arxiv preprint arxiv …, 2023 - arxiv.org
Knowledge distillation is commonly used for compressing neural networks to reduce their
inference cost and memory footprint. However, current distillation methods for auto …

When attention meets fast recurrence: Training language models with reduced compute

T Lei - arxiv preprint arxiv:2102.12459, 2021 - arxiv.org
Large language models have become increasingly difficult to train because of the growing
computation time and cost. In this work, we present SRU++, a highly-efficient architecture …

Learning to generate better than your llm

JD Chang, K Brantley, R Ramamurthy, D Misra… - arxiv preprint arxiv …, 2023 - arxiv.org
Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning Large
Language Models (LLMs) for conditional text generation. In particular, recent LLMs such as …

Driver behavioral cloning for route following in autonomous vehicles using task knowledge distillation

G Li, Z Ji, S Li, X Luo, X Qu - IEEE Transactions on Intelligent …, 2022 - ieeexplore.ieee.org
Planning appropriate driving trajectory for route following is an important function for
autonomous driving. Behavioral cloning, which allows automatic trajectory learning and …

Distillm: Towards streamlined distillation for large language models

J Ko, S Kim, T Chen, SY Yun - arxiv preprint arxiv:2402.03898, 2024 - arxiv.org
Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller
student model, reducing its inference cost and memory footprint while preserving model …

Multi-teacher distillation with single model for neural machine translation

X Liang, L Wu, J Li, T Qin, M Zhang… - IEEE/ACM Transactions …, 2022 - ieeexplore.ieee.org
Knowledge distillation (KD) is an effective strategy for neural machine translation (NMT) to
improve the performance of a student model. Usually, the teacher can guide the student to …

Teaching autoregressive language models complex tasks by demonstration

G Recchia - arxiv preprint arxiv:2109.02102, 2021 - arxiv.org
This paper demonstrates that by fine-tuning an autoregressive language model (GPT-Neo)
on appropriately structured step-by-step demonstrations, it is possible to teach it to execute a …

Target-side augmentation for document-level machine translation

G Bao, Z Teng, Y Zhang - arxiv preprint arxiv:2305.04505, 2023 - arxiv.org
Document-level machine translation faces the challenge of data sparsity due to its long input
length and a small amount of training data, increasing the risk of learning spurious patterns …