Tensor attention training: Provably efficient learning of higher-order transformers
Tensor Attention, a multi-view attention that is able to capture high-order correlations among
multiple modalities, can overcome the representational limitations of classical matrix …
multiple modalities, can overcome the representational limitations of classical matrix …
Conv-basis: A new paradigm for efficient attention inference and gradient computation in transformers
The self-attention mechanism is the key to the success of transformers in recent Large
Language Models (LLMs). However, the quadratic computational cost $ O (n^ 2) $ in the …
Language Models (LLMs). However, the quadratic computational cost $ O (n^ 2) $ in the …
Multi-layer transformers gradient can be approximated in almost linear time
The computational complexity of the self-attention mechanism in popular transformer
architectures poses significant challenges for training and inference, and becomes the …
architectures poses significant challenges for training and inference, and becomes the …
Hsr-enhanced sparse attention acceleration
Large Language Models (LLMs) have demonstrated remarkable capabilities across various
applications, but their performance on long-context tasks is often limited by the …
applications, but their performance on long-context tasks is often limited by the …
Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent
In-context learning has been recognized as a key factor in the success of Large Language
Models (LLMs). It refers to the model's ability to learn patterns on the fly from provided in …
Models (LLMs). It refers to the model's ability to learn patterns on the fly from provided in …
Differentially private attention computation
Large language models (LLMs) have had a profound impact on numerous aspects of daily
life including natural language processing, content generation, research methodologies and …
life including natural language processing, content generation, research methodologies and …
Advancing the understanding of fixed point iterations in deep neural networks: A detailed analytical study
Recent empirical studies have identified fixed point iteration phenomena in deep neural
networks, where the hidden state tends to stabilize after several layers, showing minimal …
networks, where the hidden state tends to stabilize after several layers, showing minimal …
The computational limits of state-space models and mamba via the lens of circuit complexity
In this paper, we analyze the computational limitations of Mamba and State-space Models
(SSMs) by using the circuit complexity framework. Despite Mamba's stateful design and …
(SSMs) by using the circuit complexity framework. Despite Mamba's stateful design and …
On the expressive power of modern hopfield networks
Modern Hopfield networks (MHNs) have emerged as powerful tools in deep learning,
capable of replacing components such as pooling layers, LSTMs, and attention …
capable of replacing components such as pooling layers, LSTMs, and attention …
Fast Second-order Method for Neural Networks under Small Treewidth Setting
Training neural networks is a fundamental problem in theoretical machine learning. Second-
order methods are rarely used in practice due to their high computational cost, even they …
order methods are rarely used in practice due to their high computational cost, even they …