CCAMP: An integrated translation and optimization framework for OpenACC and OpenMP

J Lambert, S Lee, JS Vetter… - … Conference for High …, 2020 - ieeexplore.ieee.org
Heterogeneous computing and exploration into specialized accelerators are inevitable in
current and future supercomputers. Although this diversity of devices is promising for …

ESA: An efficient sequence alignment algorithm for biological database search on Sunway TaihuLight

H Zhang, Z Huang, Y Chen, J Liang, X Gao - Parallel computing, 2023 - Elsevier
In computational biology, biological database search has been playing a very important role.
Since the COVID-19 outbreak, it has provided significant help in identifying common …

CUBE–Towards an Optimal Scaling of Cosmological N-body Simulations

S Cheng, HR Yu, D Inman, Q Liao… - 2020 20th IEEE/ACM …, 2020 - ieeexplore.ieee.org
N-body simulations are essential tools in physical cosmology to understand the large-scale
structure (LSS) formation of the universe. Large-scale simulations with high resolution are …

An empirical study of hpc workloads on huawei kunpeng 916 processor

YC Wang, JK Chen, BR Li, SC Zuo… - 2019 IEEE 25th …, 2019 - ieeexplore.ieee.org
The ARM-based server processors have been gaining momentum in high performance
computing (HPC). While not designed specifically for HPC, Huawei Kunpeng 916 processor …

Implementation and performance of Barnes-hut n-body algorithm on extreme-scale heterogeneous many-core architectures

M Iwasawa, D Namekata, R Sakamoto… - … Journal of High …, 2020 - journals.sagepub.com
In this paper, we report the implementation and measured performance of our extreme-scale
whole planetary ring simulation code on Sunway TaihuLight and two PEZY-SC2 systems …

OpenACC+ Athread collaborative optimization of Silicon-Crystal application on Sunway TaihuLight

J Liang, R Hua, W Zhu, Y Ye, Y Fu, H Zhang - Parallel Computing, 2022 - Elsevier
Abstract The Silicon-Crystal application based on molecular dynamics (MD) is used to
simulate the thermal conductivity of the crystal, which adopts the Tersoff potential to simulate …

The 16,384-node parallelism of 3D-CNN training on an arm CPU based supercomputer

A Tabuchi, K Shirahata, M Yamazaki… - 2021 IEEE 28th …, 2021 - ieeexplore.ieee.org
As the computational cost and datasets available for deep neural network training continue
to increase, there is a significant demand for fast distributed training on supercomputers …

NeoMPX: characterizing and improving estimation of multiplexing hardware counters for PAPI

YC Wang, J Wang, JK Chen, SC Zuo… - … on Cluster Computing …, 2020 - ieeexplore.ieee.org
Modern processors provide hundreds of low-level hardware events (such as cache miss
rate), but offer only a small number (usually 6–12) of hardware counters to collect these …

Accelerating Science with Directive-Based Programming on Heterogeneous Machines and Future Technologies

JB Lambert - 2021 - search.proquest.com
Accelerator-based heterogeneous computing has become the de facto standard in
contemporary high-performance machines, including upcoming exascale machines. These …