Pure transformers are powerful graph learners

J Kim, D Nguyen, S Min, S Cho… - Advances in Neural …, 2022‏ - proceedings.neurips.cc
We show that standard Transformers without graph-specific modifications can lead to
promising results in graph learning both in theory and practice. Given a graph, we simply …

Rethinking attention with performers

K Choromanski, V Likhosherstov, D Dohan… - arxiv preprint arxiv …, 2020‏ - arxiv.org
We introduce Performers, Transformer architectures which can estimate regular (softmax)
full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to …

Random feature attention

H Peng, N Pappas, D Yogatama, R Schwartz… - arxiv preprint arxiv …, 2021‏ - arxiv.org
Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their
core is an attention function which models pairwise interactions between the inputs at every …

Monarch: Expressive structured matrices for efficient and accurate training

T Dao, B Chen, NS Sohoni, A Desai… - International …, 2022‏ - proceedings.mlr.press
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense …

Federated learning: Strategies for improving communication efficiency

J Konečný, HB McMahan, FX Yu, P Richtárik… - arxiv preprint arxiv …, 2016‏ - arxiv.org
Federated Learning is a machine learning setting where the goal is to train a high-quality
centralized model while training data remains distributed over a large number of clients …

Random features for kernel approximation: A survey on algorithms, theory, and beyond

F Liu, X Huang, Y Chen… - IEEE Transactions on …, 2021‏ - ieeexplore.ieee.org
The class of random features is one of the most popular techniques to speed up kernel
methods in large-scale problems. Related works have been recognized by the NeurIPS Test …

Distributed mean estimation with limited communication

AT Suresh, XY Felix, S Kumar… - … on machine learning, 2017‏ - proceedings.mlr.press
Motivated by the need for distributed learning and optimization algorithms with low
communication cost, we study communication efficient algorithms for distributed mean …

Modeling the influence of data structure on learning in neural networks: The hidden manifold model

S Goldt, M Mézard, F Krzakala, L Zdeborová - Physical Review X, 2020‏ - APS
Understanding the reasons for the success of deep neural networks trained using stochastic
gradient-based methods is a key open problem for the nascent theory of deep learning. The …

Multiplicative filter networks

R Fathony, AK Sahu, D Willmott… - … Conference on Learning …, 2020‏ - openreview.net
Although deep networks are typically used to approximate functions over high dimensional
inputs, recent work has increased interest in neural networks as function approximators for …

Memory attention networks for skeleton-based action recognition

C Li, C **e, B Zhang, J Han, X Zhen… - IEEE Transactions on …, 2021‏ - ieeexplore.ieee.org
Skeleton-based action recognition has been extensively studied, but it remains an unsolved
problem because of the complex variations of skeleton joints in 3-D spatiotemporal space …