DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks

GF Oliveira, J Gómez-Luna, L Orosa, S Ghose… - IEEE …, 2021 - ieeexplore.ieee.org
Data movement between the CPU and main memory is a first-order obstacle against improv
ing performance, scalability, and energy efficiency in modern systems. Computer systems …

Hierarchical roofline analysis: How to collect data using performance tools on intel cpus and nvidia gpus

C Yang - arxiv preprint arxiv:2009.02449, 2020 - arxiv.org
This paper surveys a range of methods to collect necessary performance data on Intel CPUs
and NVIDIA GPUs for hierarchical Roofline analysis. As of mid-2020, two vendor …

[책][B] An instruction roofline model for gpus

N Ding, S Williams - 2019 - ieeexplore.ieee.org
The Roofline performance model provides an intuitive approach to identify performance
bottlenecks and guide performance optimization. However, the classic FLOP-centric …

Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system

C Yang, T Kurth, S Williams - Concurrency and Computation …, 2020 - Wiley Online Library
The Roofline performance model provides an intuitive and insightful approach to identifying
performance bottlenecks and guiding performance optimization. In preparation for the next …

A comprehensive methodology to optimize FPGA designs via the roofline model

M Siracusa, E Del Sozzo, M Rabozzi… - IEEE Transactions …, 2021 - ieeexplore.ieee.org
With reconfigurable fabrics delivering increasing performance over the years, Field-
Programmable Gate Arrays (FPGAs) are becoming an appealing solution for next …

Capability models for manycore memory systems: A case-study with Xeon Phi KNL

S Ramos, T Hoefler - 2017 IEEE International Parallel and …, 2017 - ieeexplore.ieee.org
Increasingly complex memory systems and onchip interconnects are developed to mitigate
the data movement bottlenecks in manycore processors. One example of such a complex …

High-performance matrix-matrix multiplications of very small matrices

I Masliah, A Abdelfattah, A Haidar, S Tomov… - Euro-Par 2016: Parallel …, 2016 - Springer
The use of the general dense matrix-matrix multiplication (GEMM) is fundamental for
obtaining high performance in many scientific computing applications. GEMMs for small …

An empirical roofline methodology for quantitatively assessing performance portability

C Yang, R Gayatri, T Kurth, P Basu… - 2018 IEEE/ACM …, 2018 - ieeexplore.ieee.org
System and node architectures continue to diversify to better balance on-node computation,
memory capacity, memory bandwidth, interconnect bandwidth, power, and cost for specific …

Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels

A Li, W Liu, MRB Kristensen, B Vinter, H Wang… - Proceedings of the …, 2017 - dl.acm.org
High-bandwidth On-Package Memory (OPM) innovates the conventional memory hierarchy
by augmenting a new on-package layer between classic on-chip cache and off-chip DRAM …

GIRAF: General purpose in-storage resistive associative framework

L Yavits, R Kaplan, R Ginosar - IEEE Transactions on Parallel …, 2021 - ieeexplore.ieee.org
GIRAF is a General purpose In-storage Resistive Associative Framework based on resistive
content addressable memory (RCAM), which functions simultaneously as a storage and a …