MM1: methods, analysis and insights from multimodal LLM pre-training

B McKinzie, Z Gan, JP Fauconnier, S Dodge… - … on Computer Vision, 2024 - Springer
In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …

Demystifying softmax gating function in Gaussian mixture of experts

H Nguyen, TT Nguyen, N Ho - Advances in Neural …, 2023 - proceedings.neurips.cc
Understanding the parameter estimation of softmax gating Gaussian mixture of experts has
remained a long-standing open problem in the literature. It is mainly due to three …

Automatic expert selection for multi-scenario and multi-task search

X Zou, Z Hu, Y Zhao, X Ding, Z Liu, C Li… - Proceedings of the 45th …, 2022 - dl.acm.org
Multi-scenario learning (MSL) enables a service provider to cater for users' fine-grained
demands by separating services for different user sectors, eg, by user's geographical region …

Scaling diffusion transformers to 16 billion parameters

Z Fei, M Fan, C Yu, D Li, J Huang - arxiv preprint arxiv:2407.11633, 2024 - arxiv.org
In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is
scalable and competitive with dense networks while exhibiting highly optimized inference …

Mastering stock markets with efficient mixture of diversified trading experts

S Sun, X Wang, W Xue, X Lou, B An - Proceedings of the 29th ACM …, 2023 - dl.acm.org
Quantitative stock investment is a fundamental financial task that highly relies on accurate
prediction of market status and profitable investment decision making. Despite recent …

On least squares estimation in softmax gating mixture of experts

H Nguyen, N Ho, A Rinaldo - arxiv preprint arxiv:2402.02952, 2024 - arxiv.org
Mixture of experts (MoE) model is a statistical machine learning design that aggregates
multiple expert networks using a softmax gating function in order to form a more intricate and …

Ta-moe: Topology-aware large scale mixture-of-expert training

C Chen, M Li, Z Wu, D Yu… - Advances in Neural …, 2022 - proceedings.neurips.cc
Abstract Sparsely gated Mixture-of-Expert (MoE) has demonstrated its effectiveness in
scaling up deep neural networks to an extreme scale. Despite that numerous efforts have …

MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router

Y **e, Z Zhang, D Zhou, C **e, Z Song, X Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Mixture-of-Experts (MoE) architectures face challenges such as high memory consumption
and redundancy in experts. Pruning MoE can reduce network weights while maintaining …

CompeteSMoE--Effective Training of Sparse Mixture of Experts via Competition

Q Pham, G Do, H Nguyen, TT Nguyen, C Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model
complexity beyond the mean of increasing the network's depth or width. However, effective …

Comet: Learning cardinality constrained mixture of experts with trees and local search

S Ibrahim, W Chen, H Hazimeh… - Proceedings of the 29th …, 2023 - dl.acm.org
The sparse Mixture-of-Experts (Sparse-MoE) framework efficiently scales up model capacity
in various domains, such as natural language processing and vision. Sparse-MoEs select a …