Initializing Variable-sized Vision Transformers from Learngene with Learnable Transformation

S **a, Y Zu, X Yang, X Geng - Advances in Neural …, 2025 - proceedings.neurips.cc
In practical scenarios, it is necessary to build variable-sized models to accommodate diverse
resource constraints, where weight initialization serves as a crucial step preceding training …

Superposed decoding: Multiple generations from a single autoregressive inference pass

E Shen, A Fan, SM Pratt, JS Park, M Wallingford… - arxiv preprint arxiv …, 2024 - arxiv.org
Many applications today provide users with multiple auto-complete drafts as they type,
including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto …

Neural Metamorphosis

X Yang, X Wang - European Conference on Computer Vision, 2024 - Springer
This paper introduces a new learning paradigm termed Neural Metamorphosis (NeuMeta),
which aims to build self-morphable neural networks. Contrary to crafting separate models for …

Efficient stagewise pretraining via progressive subnetworks

A Panigrahi, N Saunshi, K Lyu, S Miryoosefi… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent developments in large language models have sparked interest in efficient
pretraining methods. Stagewise training approaches to improve efficiency, like gradual …

Progressive ensemble distillation: Building ensembles for efficient inference

D Dennis, A Shetty, AP Sevekari… - Advances in Neural …, 2023 - proceedings.neurips.cc
Abstract Knowledge distillation is commonly used to compress an ensemble of models into a
single model. In this work we study the problem of progressive ensemble distillation: Given a …

Starbucks: Improved Training for 2D Matryoshka Embeddings

S Zhuang, S Wang, B Koopman, G Zuccon - arxiv preprint arxiv …, 2024 - arxiv.org
Effective approaches that can scale embedding model depth (ie layers) and embedding size
allow for the creation of models that are highly scalable across different computational …

MatMamba: A Matryoshka State Space Model

A Shukla, S Vemprala, A Kusupati, A Kapoor - arxiv preprint arxiv …, 2024 - arxiv.org
State Space Models (SSMs) like Mamba2 are a promising alternative to Transformers, with
faster theoretical training and inference times--especially for long context lengths. Recent …

Adanns: A framework for adaptive semantic search

A Rege, A Kusupati, A Fan, Q Cao… - Advances in …, 2023 - proceedings.neurips.cc
Web-scale search systems learn an encoder to embed a given query which is then hooked
into an approximate nearest neighbor search (ANNS) pipeline to retrieve similar data points …

When One LLM Drools, Multi-LLM Collaboration Rules

S Feng, W Ding, A Liu, Z Wang, W Shi, Y Wang… - arxiv preprint arxiv …, 2025 - arxiv.org
This position paper argues that in many realistic (ie, complex, contextualized, subjective)
scenarios, one LLM is not enough to produce a reliable output. We challenge the status quo …

From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs

K Nishu, S Mehta, S Abnar, M Farajtabar… - arxiv preprint arxiv …, 2025 - arxiv.org
Training large language models (LLMs) for different inference constraints is computationally
expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these …