Tiny but mighty: designing and realizing scalable latency tolerance for manycore SoCs

M Orenes-Vera, A Manocha, J Balkind, F Gao… - Proceedings of the 49th …, 2022 - dl.acm.org
Modern computing systems employ significant heterogeneity and specialization to meet
performance targets at manageable power. However, memory latency bottlenecks remain …

Decoupled vector runahead

A Naithani, J Roelandts, S Ainsworth… - Proceedings of the 56th …, 2023 - dl.acm.org
We present Decoupled Vector Runahead (DVR), an in-core prefetching technique,
executing separately to the main application thread, that exploits massive amounts of …

Precise runahead execution

A Naithani, J Feliu, A Adileh… - 2020 IEEE International …, 2020 - ieeexplore.ieee.org
Runahead execution improves processor performance by accurately prefetching long-
latency memory accesses. When a long-latency load causes the instruction window to fill up …

[HTML][HTML] Performance and power analysis of hpc workloads on heterogeneous multi-node clusters

F Mantovani, E Calore - Journal of Low Power Electronics and …, 2018 - mdpi.com
Performance analysis tools allow application developers to identify and characterize the
inefficiencies that cause performance degradation in their codes, allowing for application …

Phloem: Automatic acceleration of irregular applications with fine-grain pipeline parallelism

QM Nguyen, D Sanchez - 2023 IEEE International Symposium …, 2023 - ieeexplore.ieee.org
Irregular applications are increasingly common in diverse domains, like graph analytics and
sparse linear algebra. Accelerating these applications is challenging because of their …

Vector runahead

A Naithani, S Ainsworth, TM Jones… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org
The memory wall places a significant limit on performance for many modern workloads.
These applications feature complex chains of dependent, indirect memory accesses, which …

NOELLE Offers Empowering LLVM Extensions

A Matni, EA Deiana, Y Su, L Gross… - 2022 IEEE/ACM …, 2022 - ieeexplore.ieee.org
Modern and emerging architectures demand increasingly complex compiler analyses and
transformations. As the emphasis on compiler infrastructure moves beyond support for …

The forward slice core microarchitecture

K Lakshminarasimhan, A Naithani, J Feliu… - Proceedings of the …, 2020 - dl.acm.org
Superscalar out-of-order cores deliver high performance at the cost of increased complexity
and power budget. In-order cores, in contrast, are less complex and have a smaller power …

HePREM: Enabling predictable GPU execution on heterogeneous SoC

B Forsberg, L Benini, A Marongiu - 2018 Design, Automation & …, 2018 - ieeexplore.ieee.org
Heterogeneous systems-on-a-chip are increasingly embracing shared memory designs, in
which a single DRAM is used for both the main CPU and an integrated GPU. This …

Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access

L Wang, X Zhang, S Wang, Z Jiang, T Lu… - ACM Transactions on …, 2024 - dl.acm.org
The growing memory demands of modern applications have driven the adoption of far
memory technologies in data centers to provide cost-effective, high-capacity memory …