Alphapruning: Using heavy-tailed self regularization theory for improved layer-wise pruning of large language models

H Lu, Y Zhou, S Liu, Z Wang… - Advances in Neural …, 2025 - proceedings.neurips.cc
Recent work on pruning large language models (LLMs) has shown that one can eliminate a
large number of parameters without compromising performance, making pruning a …

Are Gaussian data all you need? The extents and limits of universality in high-dimensional generalized linear estimation

L Pesce, F Krzakala, B Loureiro… - … on Machine Learning, 2023 - proceedings.mlr.press
In this manuscript we consider the problem of generalized linear estimation on Gaussian
mixture data with labels given by a single-index model. Our first result is a sharp asymptotic …

A theory of non-linear feature learning with one gradient step in two-layer neural networks

B Moniri, D Lee, H Hassani, E Dobriban - arxiv preprint arxiv:2310.07891, 2023 - arxiv.org
Feature learning is thought to be one of the fundamental reasons for the success of deep
neural networks. It is rigorously known that in two-layer fully-connected neural networks …

Temperature balancing, layer-wise weight analysis, and neural network training

Y Zhou, T Pang, K Liu… - Advances in Neural …, 2024 - proceedings.neurips.cc
Regularization in modern machine learning is crucial, and it can take various forms in
algorithmic design: training set, model family, error function, regularization terms, and …

DiffDomain enables identification of structurally reorganized topologically associating domains

D Hua, M Gu, X Zhang, Y Du, H **e, L Qi, X Du… - Nature …, 2024 - nature.com
Topologically associating domains (TADs) are critical structural units in three-dimensional
genome organization of mammalian genome. Dynamic reorganizations of TADs between …

Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties

C Paquette, E Paquette, B Adlam… - Mathematical …, 2024 - Springer
We develop a stochastic differential equation, called homogenized SGD, for analyzing the
dynamics of stochastic gradient descent (SGD) on a high-dimensional random least squares …

Sketched ridgeless linear regression: The role of downsampling

X Chen, Y Zeng, S Yang, Q Sun - … Conference on Machine …, 2023 - proceedings.mlr.press
Overparametrization often helps improve the generalization performance. This paper
presents a dual view of overparametrization suggesting that downsampling may also help …

Demystifying disagreement-on-the-line in high dimensions

D Lee, B Moniri, X Huang… - International …, 2023 - proceedings.mlr.press
Evaluating the performance of machine learning models under distribution shifts is
challenging, especially when we only have unlabeled data from the shifted (target) domain …

AlphaLoRA: Assigning LoRA Experts Based on Layer Training Quality

P Qing, C Gao, Y Zhou, X Diao, Y Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Parameter-efficient fine-tuning methods, such as Low-Rank Adaptation (LoRA), are known
to enhance training efficiency in Large Language Models (LLMs). Due to the limited …

" Lossless" compression of deep neural networks: a high-dimensional neural tangent kernel approach

Y Du, D **e, S Pu, R Qiu, Z Liao - Advances in Neural …, 2022 - proceedings.neurips.cc
Modern deep neural networks (DNNs) are extremely powerful; however, this comes at the
price of increased depth and having more parameters per layer, making their training and …