Towards efficient generative large language model serving: A survey from algorithms to systems X Miao, G Oliaro, Z Zhang, X Cheng, H Jin, T Chen, Z Jia arXiv preprint arXiv:2312.15234, 2023 | 70 | 2023 |
Tensorir: An abstraction for automatic tensorized program optimization S Feng, B Hou, H Jin, W Lin, J Shao, R Lai, Z Ye, L Zheng, CH Yu, Y Yu, ... Proceedings of the 28th ACM International Conference on Architectural …, 2023 | 70 | 2023 |
Tensor program optimization with probabilistic programs J Shao, X Zhou, S Feng, B Hou, R Lai, H Jin, W Lin, M Masuda, CH Yu, ... Advances in Neural Information Processing Systems 35, 35783-35796, 2022 | 28 | 2022 |
Accelerating self-attentions for llm serving with flashinfer Z Ye, L Chen, R Lai, Y Zhao, S Zheng, J Shao, B Hou, H Jin, Y Zuo, L Yin, ... URL https://flashinfer. ai/2024/02/02/introduce-flashinfer. html, 2024 | 11 | 2024 |
Relax: Composable Abstractions for End-to-End Dynamic Machine Learning R Lai, J Shao, S Feng, SS Lyubomirsky, B Hou, W Lin, Z Ye, H Jin, Y Jin, ... arXiv preprint arXiv:2311.02103, 2023 | 9 | 2023 |
WebLLM: A High-Performance In-Browser LLM Inference Engine CF Ruan, Y Qin, X Zhou, R Lai, H Jin, Y Dong, B Hou, MS Yu, Y Zhai, ... arXiv preprint arXiv:2412.15803, 2024 | | 2024 |
A System for Microserving of LLMs H Jin, R Lai, CF Ruan, Y Wang, TC Mowry, X Miao, Z Jia, T Chen arXiv preprint arXiv:2412.12488, 2024 | | 2024 |