Conditional adapters: Parameter-efficient transfer learning with fast inference
Abstract We propose Conditional Adapter (CoDA), a parameter-efficient transfer learning
method that also improves inference efficiency. CoDA generalizes beyond standard adapter …
method that also improves inference efficiency. CoDA generalizes beyond standard adapter …
Distillspec: Improving speculative decoding via knowledge distillation
Speculative decoding (SD) accelerates large language model inference by employing a
faster draft model for generating multiple tokens, which are then verified in parallel by the …
faster draft model for generating multiple tokens, which are then verified in parallel by the …
Gkd: Generalized knowledge distillation for auto-regressive sequence models
Knowledge distillation is commonly used for compressing neural networks to reduce their
inference cost and memory footprint. However, current distillation methods for auto …
inference cost and memory footprint. However, current distillation methods for auto …
When attention meets fast recurrence: Training language models with reduced compute
T Lei - arxiv preprint arxiv:2102.12459, 2021 - arxiv.org
Large language models have become increasingly difficult to train because of the growing
computation time and cost. In this work, we present SRU++, a highly-efficient architecture …
computation time and cost. In this work, we present SRU++, a highly-efficient architecture …
Learning to generate better than your llm
Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning Large
Language Models (LLMs) for conditional text generation. In particular, recent LLMs such as …
Language Models (LLMs) for conditional text generation. In particular, recent LLMs such as …
Driver behavioral cloning for route following in autonomous vehicles using task knowledge distillation
Planning appropriate driving trajectory for route following is an important function for
autonomous driving. Behavioral cloning, which allows automatic trajectory learning and …
autonomous driving. Behavioral cloning, which allows automatic trajectory learning and …
Distillm: Towards streamlined distillation for large language models
Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller
student model, reducing its inference cost and memory footprint while preserving model …
student model, reducing its inference cost and memory footprint while preserving model …
Multi-teacher distillation with single model for neural machine translation
Knowledge distillation (KD) is an effective strategy for neural machine translation (NMT) to
improve the performance of a student model. Usually, the teacher can guide the student to …
improve the performance of a student model. Usually, the teacher can guide the student to …
Teaching autoregressive language models complex tasks by demonstration
G Recchia - arxiv preprint arxiv:2109.02102, 2021 - arxiv.org
This paper demonstrates that by fine-tuning an autoregressive language model (GPT-Neo)
on appropriately structured step-by-step demonstrations, it is possible to teach it to execute a …
on appropriately structured step-by-step demonstrations, it is possible to teach it to execute a …
Target-side augmentation for document-level machine translation
Document-level machine translation faces the challenge of data sparsity due to its long input
length and a small amount of training data, increasing the risk of learning spurious patterns …
length and a small amount of training data, increasing the risk of learning spurious patterns …