Alphapruning: Using heavy-tailed self regularization theory for improved layer-wise pruning of large language models
Recent work on pruning large language models (LLMs) has shown that one can eliminate a
large number of parameters without compromising performance, making pruning a …
large number of parameters without compromising performance, making pruning a …
Are Gaussian data all you need? The extents and limits of universality in high-dimensional generalized linear estimation
In this manuscript we consider the problem of generalized linear estimation on Gaussian
mixture data with labels given by a single-index model. Our first result is a sharp asymptotic …
mixture data with labels given by a single-index model. Our first result is a sharp asymptotic …
A theory of non-linear feature learning with one gradient step in two-layer neural networks
Feature learning is thought to be one of the fundamental reasons for the success of deep
neural networks. It is rigorously known that in two-layer fully-connected neural networks …
neural networks. It is rigorously known that in two-layer fully-connected neural networks …
Temperature balancing, layer-wise weight analysis, and neural network training
Regularization in modern machine learning is crucial, and it can take various forms in
algorithmic design: training set, model family, error function, regularization terms, and …
algorithmic design: training set, model family, error function, regularization terms, and …
DiffDomain enables identification of structurally reorganized topologically associating domains
Topologically associating domains (TADs) are critical structural units in three-dimensional
genome organization of mammalian genome. Dynamic reorganizations of TADs between …
genome organization of mammalian genome. Dynamic reorganizations of TADs between …
Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties
We develop a stochastic differential equation, called homogenized SGD, for analyzing the
dynamics of stochastic gradient descent (SGD) on a high-dimensional random least squares …
dynamics of stochastic gradient descent (SGD) on a high-dimensional random least squares …
Sketched ridgeless linear regression: The role of downsampling
Overparametrization often helps improve the generalization performance. This paper
presents a dual view of overparametrization suggesting that downsampling may also help …
presents a dual view of overparametrization suggesting that downsampling may also help …
Demystifying disagreement-on-the-line in high dimensions
Evaluating the performance of machine learning models under distribution shifts is
challenging, especially when we only have unlabeled data from the shifted (target) domain …
challenging, especially when we only have unlabeled data from the shifted (target) domain …
AlphaLoRA: Assigning LoRA Experts Based on Layer Training Quality
Parameter-efficient fine-tuning methods, such as Low-Rank Adaptation (LoRA), are known
to enhance training efficiency in Large Language Models (LLMs). Due to the limited …
to enhance training efficiency in Large Language Models (LLMs). Due to the limited …
" Lossless" compression of deep neural networks: a high-dimensional neural tangent kernel approach
Modern deep neural networks (DNNs) are extremely powerful; however, this comes at the
price of increased depth and having more parameters per layer, making their training and …
price of increased depth and having more parameters per layer, making their training and …