Tensor attention training: Provably efficient learning of higher-order transformers

Y Liang, Z Shi, Z Song, Y Zhou - arxiv preprint arxiv:2405.16411, 2024 - arxiv.org
Tensor Attention, a multi-view attention that is able to capture high-order correlations among
multiple modalities, can overcome the representational limitations of classical matrix …

Conv-basis: A new paradigm for efficient attention inference and gradient computation in transformers

Y Liang, H Liu, Z Shi, Z Song, Z Xu, J Yin - arxiv preprint arxiv:2405.05219, 2024 - arxiv.org
The self-attention mechanism is the key to the success of transformers in recent Large
Language Models (LLMs). However, the quadratic computational cost $ O (n^ 2) $ in the …

Multi-layer transformers gradient can be approximated in almost linear time

Y Liang, Z Sha, Z Shi, Z Song, Y Zhou - arxiv preprint arxiv:2408.13233, 2024 - arxiv.org
The computational complexity of the self-attention mechanism in popular transformer
architectures poses significant challenges for training and inference, and becomes the …

Hsr-enhanced sparse attention acceleration

B Chen, Y Liang, Z Sha, Z Shi, Z Song - arxiv preprint arxiv:2410.10165, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated remarkable capabilities across various
applications, but their performance on long-context tasks is often limited by the …

Bypassing the exponential dependency: Looped transformers efficiently learn in-context by multi-step gradient descent

B Chen, X Li, Y Liang, Z Shi, Z Song - arxiv preprint arxiv:2410.11268, 2024 - arxiv.org
In-context learning has been recognized as a key factor in the success of Large Language
Models (LLMs). It refers to the model's ability to learn patterns on the fly from provided in …

Differentially private attention computation

Y Gao, Z Song, X Yang, Y Zhou - arxiv preprint arxiv:2305.04701, 2023 - arxiv.org
Large language models (LLMs) have had a profound impact on numerous aspects of daily
life including natural language processing, content generation, research methodologies and …

Advancing the understanding of fixed point iterations in deep neural networks: A detailed analytical study

Y Ke, X Li, Y Liang, Z Shi, Z Song - arxiv preprint arxiv:2410.11279, 2024 - arxiv.org
Recent empirical studies have identified fixed point iteration phenomena in deep neural
networks, where the hidden state tends to stabilize after several layers, showing minimal …

The computational limits of state-space models and mamba via the lens of circuit complexity

Y Chen, X Li, Y Liang, Z Shi, Z Song - arxiv preprint arxiv:2412.06148, 2024 - arxiv.org
In this paper, we analyze the computational limitations of Mamba and State-space Models
(SSMs) by using the circuit complexity framework. Despite Mamba's stateful design and …

On the expressive power of modern hopfield networks

X Li, Y Li, Y Liang, Z Shi, Z Song - arxiv preprint arxiv:2412.05562, 2024 - arxiv.org
Modern Hopfield networks (MHNs) have emerged as powerful tools in deep learning,
capable of replacing components such as pooling layers, LSTMs, and attention …

Fast Second-order Method for Neural Networks under Small Treewidth Setting

X Li, J Long, Z Song, T Zhou - 2024 IEEE International …, 2024 - ieeexplore.ieee.org
Training neural networks is a fundamental problem in theoretical machine learning. Second-
order methods are rarely used in practice due to their high computational cost, even they …