The Rise of Sparse Mixture-of-Experts: A Survey from Algorithmic Foundations to Decentralized Architectures and Vertical Domain Applications

Dong Pan; Bingtao Li; Yongsheng Zheng; Jiren Ma; Victor Fei

doi:10.54097/bvpfjj49

Authors

Dong Pan
Bingtao Li
Yongsheng Zheng
Jiren Ma
Victor Fei

DOI:

https://doi.org/10.54097/bvpfjj49

Keywords:

Mixture-of-Experts, Decentralized Learning, LLM, Transformer

Abstract

The sparse Mixture of Experts (MoE) architecture has evolved as a powerful approach for scaling deep learning models to more parameters with comparable computation cost. As an important branch of large language model (LLM), MoE model only activate a subset of experts based on a routing network. This sparse conditional computation mechanism significantly improves computational efficiency, paving a promising path for greater scalability and cost-efficiency. It not only enhance downstream applications such as natural language processing, computer vision, and multimodal in various horizontal domains, but also exhibit broad applicability across vertical domains including medical diagnosis, autonomous driving, financial analysis, and business intelligence. Despite the growing popularity and application of MoE models across various domains, there lacks a systematic exploration of recent advancements of MoE in many important fields. Existing surveys on MoE suffer from limitations such as lack coverage or not extensively exploration of key areas. This survey seeks to fill these gaps. In this paper, Firstly, we examine the foundational principles of MoE, with an in-depth exploration of its core components—the routing network and expert network. Subsequently, we extend beyond the centralized paradigm to the decentralized paradigm, which unlocks the immense untapped potential of decentralized infrastructure, enables democratization of MoE development for broader communities, and delivers greater scalability and cost-efficiency. Furthermore we focus on exploring its vertical domain applications. Finally, we also identify key challenges and promising future research directions. To the best of our knowledge, this survey is currently the most comprehensive review in the field of MoE. We aim for this article to serve as a valuable resource for both researchers and practitioners, enabling them to navigate and stay up-to-date with the latest advancements.

Downloads

Download data is not yet available.

References

[1] M. Abdin, J. Aneja, H. Awadalla, A. Awadallah, A. A. Awan, N. Bach, A. Bahree, A. Bakhtiari, J. Bao, H. Behl, Et Al., Phi-3 technical report: A highly capable language model locally on your phone, 2024, arXiv preprint arXiv:2404.14219, (2024).

[2] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, Et Al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774, (2023).

[3] S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, Et Al., gpt-oss-120b & gpt-oss-20b model card, arXiv preprint arXiv:2508.10925, (2025).

[4] S. Agarwal, H. Wang, S. Venkataraman, And D. Papailiopoulos, On the utility of gradient compression in distributed training systems, Proceedings of Machine Learning and Systems, 2022, pp. 652–672.

[5] R. Akrour, D. Tateo, And J. Peters, Continuous action reinforcement learning from a mixture of interpretable experts, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, pp. 6795–6806.

[6] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, And B.-Y. Su, Scaling distributed machine learning with the parameter server, in Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, 2014, pp. 583–598.

[7] V. I. Anireh, E. N. Osegi, And A. Silas, A model for customer opinion mining and sentiment classification using a mixture of experts machine learning model, Computer Science, 2024, pp. 51–61.

[8] D. Arfeen, Z. Zhang, X. Fu, G. R. Ganger, And Y. Wang, Pipefill: Using gpus during bubbles in pipeline-parallel llm training, arXiv preprint arXiv:2410.07192, (2024).

[9] Z. Bian, Q. Xu, B. Wang, And Y. You, Maximizing parallelism in distributed training for huge neural networks. corr abs/2105.14450 (2021), arXiv preprint arXiv:2105.14450, (2021).

[10] A. Borzunov, D. Baranchuk, T. Dettmers, M. Ryabinin, Y. Belkada, A. Chumachenko, P. Samygin, And C. Raffel, Petals: Collaborative inference and fine-tuning of large models, arXiv preprint arXiv:2209.01188, (2022).

[11] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, Et Al., Language models are few-shot learners, Advances in neural information processing systems, 2020, pp. 1877–1901.

[12] W. Cai, J. Jiang, F. Wang, J. Tang, S. Kim, And J. Huang, A survey on mixture of experts in large language models, IEEE Transactions on Knowledge and Data Engineering, 2025, pp. 3896-3915.

[13] Q. Chen, L. Zhu, H. He, X. Zhang, S. Zeng, Q. Ren, And Y. Lu, Low-rank mixture-of-experts for continual medical image segmentation, in International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2024, pp. 382–392.

[14] Z.-K. Chong, H. Ohsaki, And B. Ng, Llm-net: Democratizing llms-as-a-service through blockchain-based expert networks, arXiv preprint arXiv:2501.07288, (2025).

[15] S. Chopra, L. Mao, G. Sanchez-Rodriguez, A. J. Feola, J. Li, And Z. Kira, Medmoe: Modality-specialized mixture of experts for medical vision-language understanding, arXiv preprint arXiv:2506.08356, (2025).

[16] D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Et Al., Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models, arXiv preprint arXiv:2401.06066, (2024).

[17] J. Dean, Introducing pathways: A next-generation ai architecture, 2021, URL https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture.

[18] M. Diskin, A. Bukhtiyarov, M. Ryabinin, L. Saulnier, A. Sinitsin, D. Popov, D. V. Pyrkin, M. Kashirin, A. Borzunov, A. Villanova Del Moral, Et Al., Distributed deep learning in open collaborations, Advances in Neural Information Processing Systems, 2021, pp. 7879–7897.

[19] A. Douillard, Q. Feng, A. A. Rusu, R. Chhaparia, Y. Donchev, A. Kuncoro, M. Ranzato, A. Szlam, And J. Shen, Diloco: Distributed low-communication training of language models, arXiv preprint arXiv:2311.08105, (2023).

[20] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, Et Al., Glam: Efficient scaling of language models with mixture-of-experts, in International conference on machine learning, PMLR, 2022, pp. 5547–5569.