- Academic Search

B Li, Y Jiang, V Gadepally, D Tiwari - ar**

R Kong, Y Li, Q Feng, W Wang, L Kong… - arxiv preprint arxiv …, 2023 - arxiv.org

Mixture of experts (MoE) is a popular technique in deep learning that improves model
capacity with conditionally-activated parallel neural network modules (experts). However …

Simpan Kutip Dirujuk 5 kali Artikel terkait 2 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling

J Li, S Tripathi, L Rastogi, Y Lei, R Pan… - arxiv preprint arxiv …, 2024 - arxiv.org

As machine learning models scale in size and complexity, their computational requirements
become a significant barrier. Mixture-of-Experts (MoE) models alleviate this issue by …

Simpan Kutip Dirujuk 1 kali Artikel terkait 4 versi Versi HTML

FLAME: Fully Leveraging MoE Sparsity for Transformer on FPGA

X Lin, H Tian, W Xue, L Ma, J Cao, M Zhang… - Proceedings of the 61st …, 2024 - dl.acm.org

MoE (Mixture-of-Experts) mechanism has been widely adopted in transformer-based models
to facilitate further expansion of model parameter size and enhance generalization …

Simpan Kutip Dirujuk 1 kali Artikel terkait

APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes

Y Wei, J Du, J Jiang, X Shi, X Zhang… - … Conference for High …, 2024 - ieeexplore.ieee.org

Recently, the sparsely-gated Mixture-Of-Experts (MoE) architecture has garnered significant
attention. To benefit a wider audience, fine-tuning MoE models on more affordable clusters …

Simpan Kutip Artikel terkait 3 versi

Buat notifikasi

Kutip

Penelusuran lanjutan

Disimpan ke Koleksi saya

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference

Llm inference serving: Survey of recent advances and opportunities

Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling

FLAME: Fully Leveraging MoE Sparsity for Transformer on FPGA

APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes