Llm inference serving: Survey of recent advances and opportunities

B Li, Y Jiang, V Gadepally, D Tiwari - ar**
R Kong, Y Li, Q Feng, W Wang, L Kong… - arxiv preprint arxiv …, 2023 - arxiv.org
Mixture of experts (MoE) is a popular technique in deep learning that improves model
capacity with conditionally-activated parallel neural network modules (experts). However …

Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling

J Li, S Tripathi, L Rastogi, Y Lei, R Pan… - arxiv preprint arxiv …, 2024 - arxiv.org
As machine learning models scale in size and complexity, their computational requirements
become a significant barrier. Mixture-of-Experts (MoE) models alleviate this issue by …

FLAME: Fully Leveraging MoE Sparsity for Transformer on FPGA

X Lin, H Tian, W Xue, L Ma, J Cao, M Zhang… - Proceedings of the 61st …, 2024 - dl.acm.org
MoE (Mixture-of-Experts) mechanism has been widely adopted in transformer-based models
to facilitate further expansion of model parameter size and enhance generalization …

APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes

Y Wei, J Du, J Jiang, X Shi, X Zhang… - … Conference for High …, 2024 - ieeexplore.ieee.org
Recently, the sparsely-gated Mixture-Of-Experts (MoE) architecture has garnered significant
attention. To benefit a wider audience, fine-tuning MoE models on more affordable clusters …