Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect

A Li, SL Song, J Chen, J Li, X Liu… - … on Parallel and …, 2019 - ieeexplore.ieee.org
High performance multi-GPU computing becomes an inevitable trend due to the ever-
increasing demand on computation capability in emerging domains such as deep learning …

Sv-sim: scalable pgas-based state vector simulation of quantum circuits

A Li, B Fang, C Granade, G Prawiroatmodjo… - Proceedings of the …, 2021 - dl.acm.org
High-performance quantum circuit simulation in a classic HPC is still imperative in the NISQ
era. Observing that the major obstacle of scalable state-vector quantum simulation arises …

Apnn-tc: Accelerating arbitrary precision neural networks on ampere gpu tensor cores

B Feng, Y Wang, T Geng, A Li, Y Ding - Proceedings of the international …, 2021 - dl.acm.org
Over the years, accelerating neural networks with quantization has been widely studied.
Unfortunately, prior efforts with diverse precisions (eg, 1-bit weights and 2-bit activations) are …

Density matrix quantum circuit simulation via the BSP machine on modern GPU clusters

A Li, O Subasi, X Yang… - … conference for high …, 2020 - ieeexplore.ieee.org
As quantum computers evolve, simulations of quantum programs on classical computers will
be essential in validating quantum algorithms, understanding the effect of system noise, and …

Tartan: evaluating modern GPU interconnect via a multi-GPU benchmark suite

A Li, SL Song, J Chen, X Liu, N Tallent… - 2018 IEEE …, 2018 - ieeexplore.ieee.org
High performance multi-GPU computing becomes an inevitable trend due to the ever-
increasing demand on computation capability in emerging domains such as deep learning …

Register optimizations for stencils on GPUs

PS Rawat, F Rastello, A Sukumaran-Rajam… - Proceedings of the 23rd …, 2018 - dl.acm.org
The recent advent of compute-intensive GPU architecture has allowed application
developers to explore high-order 3D stencils for better computational accuracy. A common …

Accelerating binarized neural networks via bit-tensor-cores in turing gpus

A Li, S Su - IEEE Transactions on Parallel and Distributed …, 2020 - ieeexplore.ieee.org
Despite foreseeing tremendous speedups over conventional deep neural networks, the
performance advantage of binarized neural networks (BNNs) has merely been showcased …

BSTC: A novel binarized-soft-tensor-core design for accelerating bit-based approximated neural nets

A Li, T Geng, T Wang, M Herbordt, SL Song… - Proceedings of the …, 2019 - dl.acm.org
Binarized neural networks (or BNNs) promise tremendous performance improvement over
traditional DNNs through simplified bit-level computation and significantly reduced memory …

Mapa: Multi-accelerator pattern allocation policy for multi-tenant gpu servers

K Ranganath, JD Suetterlein, JB Manzano… - Proceedings of the …, 2021 - dl.acm.org
Multi-accelerator servers are increasingly being deployed in shared multi-tenant
environments (such as in cloud data centers) in order to meet the demands of large-scale …

Adaptive auto-tuning framework for global exploration of stencil optimization on gpus

Q Sun, Y Liu, H Yang, Z Jiang, Z Luan… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Stencil computations are widely used in high performance computing (HPC) applications.
Many HPC platforms utilize the high computation capability of GPUs to accelerate stencil …