A review of sparse expert models in deep learning

W Fedus, J Dean, B Zoph - arxiv preprint arxiv:2209.01667, 2022 - arxiv.org
Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in
deep learning. This class of architecture encompasses Mixture-of-Experts, Switch …

Branch-train-merge: Embarrassingly parallel training of expert language models

M Li, S Gururangan, T Dettmers, M Lewis… - arxiv preprint arxiv …, 2022 - arxiv.org
We present Branch-Train-Merge (BTM), a communication-efficient algorithm for
embarrassingly parallel training of large language models (LLMs). We show it is possible to …

Continual pre-training of large language models: How to (re) warm your model?

K Gupta, B Thérien, A Ibrahim, ML Richter… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart
the process over again once new data becomes available. A much cheaper and more …

Unified scaling laws for routed language models

A Clark, D de Las Casas, A Guy… - International …, 2022 - proceedings.mlr.press
The performance of a language model has been shown to be effectively modeled as a
power-law in its parameter count. Here we study the scaling behaviors of Routing Networks …

Dynamically expandable graph convolution for streaming recommendation

B He, X He, Y Zhang, R Tang, C Ma - … of the ACM Web Conference 2023, 2023 - dl.acm.org
Personalized recommender systems have been widely studied and deployed to reduce
information overload and satisfy users' diverse needs. However, conventional …

Progfed: effective, communication, and computation efficient federated learning by progressive training

HP Wang, S Stich, Y He, M Fritz - … Conference on Machine …, 2022 - proceedings.mlr.press
Federated learning is a powerful distributed learning scheme that allows numerous edge
devices to collaboratively train a model without sharing their data. However, training is …

Learning equi-angular representations for online continual learning

M Seo, H Koh, W Jeung, M Lee, S Kim… - Proceedings of the …, 2024 - openaccess.thecvf.com
Online continual learning suffers from an underfitted solution due to insufficient training for
prompt model updates (eg single-epoch training). To address the challenge we propose an …

Nevis' 22: A stream of 100 tasks sampled from 30 years of computer vision research

J Bornschein, A Galashov, R Hemsley… - Journal of Machine …, 2023 - jmlr.org
A shared goal of several machine learning communities like continual learning, meta-
learning and transfer learning, is to design algorithms and models that efficiently and …

Just say the name: Online continual learning with category names only via data generation

M Seo, S Cho, M Lee, D Misra, H Choi, SJ Kim… - arxiv preprint arxiv …, 2024 - arxiv.org
Requiring extensive human supervision is often impractical for continual learning due to its
cost, leading to the emergence of'name-only continual learning'that only provides the name …

When does re-initialization work?

S Zaidi, T Berariu, H Kim, J Bornschein… - Proceedings …, 2023 - proceedings.mlr.press
Re-initializing a neural network during training has been observed to improve generalization
in recent works. Yet it is neither widely adopted in deep learning practice nor is it often used …