{PET}: Optimizing tensor programs with partially equivalent transformations and automated corrections H Wang, J Zhai, M Gao, Z Ma, S Tang, L Zheng, Y Li, K Rong, Y Chen, ...
15th USENIX Symposium on Operating Systems Design and Implementation (OSDI …, 2021
75 2021 Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models J He, J Zhai, T Antunes, H Wang, F Luo, S Shi, Q Li
Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of …, 2022
67 2022 BaGuaLu: targeting brain scale pretrained models with over 37 million cores Z Ma, J He, J Qiu, H Cao, Y Wang, Z Sun, L Zheng, H Wang, S Tang, ...
Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of …, 2022
59 2022 HyQuas: hybrid partitioner based quantum circuit simulation system on GPU C Zhang, Z Song, H Wang, K Rong, J Zhai
Proceedings of the 35th ACM International Conference on Supercomputing, 443-454, 2021
28 2021 Spindle: Informed memory access monitoring H Wang, J Zhai, X Tang, B Yu, X Ma, W Chen
2018 USENIX Annual Technical Conference (USENIX ATC 18), 561-574, 2018
26 2018 Scaling graph traversal to 281 trillion edges with 40 million cores H Cao, Y Wang, H Wang, H Lin, Z Ma, W Yin, W Chen
Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of …, 2022
23 2022 Spread-n-share: improving application performance and cluster throughput with resource-aware job placement X Tang, H Wang, X Ma, N El-Sayed, J Zhai, W Chen, A Aboulnaga
Proceedings of the International Conference for High Performance Computing …, 2019
18 2019 FreeTensor: a free-form DSL with holistic optimizations for irregular tensor programs S Tang, J Zhai, H Wang, L Jiang, L Zheng, Z Yuan, C Zhang
Proceedings of the 43rd ACM SIGPLAN International Conference on Programming …, 2022
14 2022 : Large-Scale Graph Triangle Counting on a Single Machine Using GPUs J Huang, H Wang, X Fei, X Wang, W Chen
IEEE Transactions on Parallel and Distributed Systems 33 (11), 3067-3078, 2021
13 2021 UniQ: A unified programming model for efficient quantum circuit simulation C Zhang, H Wang, Z Ma, L Xie, Z Song, J Zhai
SC22: International Conference for High Performance Computing, Networking …, 2022
12 2022 ScalAna: Automating scaling loss detection with graph analysis Y Jin, H Wang, T Yu, X Tang, T Hoefler, X Liu, J Zhai
SC20: International Conference for High Performance Computing, Networking …, 2020
12 2020 PerFlow: A domain specific framework for automatic performance analysis of parallel applications Y Jin, H Wang, R Zhong, C Zhang, J Zhai
Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of …, 2022
10 2022 Vapro: Performance variance detection and diagnosis for production-run parallel applications L Zheng, J Zhai, X Tang, H Wang, T Yu, Y Jin, SL Song, W Chen
Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of …, 2022
9 2022 Efficiently emulating high-bitwidth computation with low-bitwidth hardware Z Ma, H Wang, G Feng, C Zhang, L Xie, J He, S Chen, J Zhai
Proceedings of the 36th ACM International Conference on Supercomputing, 1-12, 2022
6 2022 LotusSQL: SQL engine for high-performance big data systems X Li, B Yu, G Feng, H Wang, W Chen
Big Data Mining and Analytics 4 (4), 252-265, 2021
6 2021 Identifying scalability bottlenecks for large-scale parallel programs with graph analysis Y Jin, H Wang, X Tang, T Hoefler, X Liu, J Zhai
Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of …, 2020
4 2020 An Efficient Sparse CNNs Accelerator on FPGA Y Zhang, H Jiang, X Li, H Wang, D Dong, Y Cao
2022 IEEE International Conference on Cluster Computing (CLUSTER), 504-505, 2022
2 2022 OLLIE: Derivation-based tensor program optimizer L Zheng, H Wang, J Zhai, M Hu, Z Ma, T Wang, S Tang, L Xie, K Huang, ...
arXiv preprint arXiv:2208.02025, 2022
2 2022 Detecting performance variance for parallel applications without source code J Zhai, L Zheng, F Zhang, X Tang, H Wang, T Yu, Y Jin, SL Song, W Chen
IEEE Transactions on Parallel and Distributed Systems 33 (12), 4239-4255, 2022
2 2022 Sparker: Efficient reduction for more scalable machine learning with spark B Yu, H Cao, T Shan, H Wang, X Tang, W Chen
Proceedings of the 50th International Conference on Parallel Processing, 1-11, 2021
2 2021