MM1: methods, analysis and insights from multimodal LLM pre-training
In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …
In particular, we study the importance of various architecture components and data choices …
Demystifying softmax gating function in Gaussian mixture of experts
Understanding the parameter estimation of softmax gating Gaussian mixture of experts has
remained a long-standing open problem in the literature. It is mainly due to three …
remained a long-standing open problem in the literature. It is mainly due to three …
Automatic expert selection for multi-scenario and multi-task search
Multi-scenario learning (MSL) enables a service provider to cater for users' fine-grained
demands by separating services for different user sectors, eg, by user's geographical region …
demands by separating services for different user sectors, eg, by user's geographical region …
Scaling diffusion transformers to 16 billion parameters
In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is
scalable and competitive with dense networks while exhibiting highly optimized inference …
scalable and competitive with dense networks while exhibiting highly optimized inference …
Mastering stock markets with efficient mixture of diversified trading experts
Quantitative stock investment is a fundamental financial task that highly relies on accurate
prediction of market status and profitable investment decision making. Despite recent …
prediction of market status and profitable investment decision making. Despite recent …
On least squares estimation in softmax gating mixture of experts
Mixture of experts (MoE) model is a statistical machine learning design that aggregates
multiple expert networks using a softmax gating function in order to form a more intricate and …
multiple expert networks using a softmax gating function in order to form a more intricate and …
Ta-moe: Topology-aware large scale mixture-of-expert training
Abstract Sparsely gated Mixture-of-Expert (MoE) has demonstrated its effectiveness in
scaling up deep neural networks to an extreme scale. Despite that numerous efforts have …
scaling up deep neural networks to an extreme scale. Despite that numerous efforts have …
MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router
Mixture-of-Experts (MoE) architectures face challenges such as high memory consumption
and redundancy in experts. Pruning MoE can reduce network weights while maintaining …
and redundancy in experts. Pruning MoE can reduce network weights while maintaining …
CompeteSMoE--Effective Training of Sparse Mixture of Experts via Competition
Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model
complexity beyond the mean of increasing the network's depth or width. However, effective …
complexity beyond the mean of increasing the network's depth or width. However, effective …
Comet: Learning cardinality constrained mixture of experts with trees and local search
The sparse Mixture-of-Experts (Sparse-MoE) framework efficiently scales up model capacity
in various domains, such as natural language processing and vision. Sparse-MoEs select a …
in various domains, such as natural language processing and vision. Sparse-MoEs select a …