A survey of coarse-grained reconfigurable architecture and design: Taxonomy, challenges, and applications

L Liu, J Zhu, Z Li, Y Lu, Y Deng, J Han, S Yin… - ACM Computing …, 2019 - dl.acm.org
As general-purpose processors have hit the power wall and chip fabrication cost escalates
alarmingly, coarse-grained reconfigurable architectures (CGRAs) are attracting increasing …

Simba: Scaling deep-learning inference with multi-chip-module-based architecture

YS Shao, J Clemons, R Venkatesan, B Zimmer… - Proceedings of the …, 2019 - dl.acm.org
Package-level integration using multi-chip-modules (MCMs) is a promising approach for
building large-scale systems. Compared to a large monolithic die, an MCM combines many …

DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks

GF Oliveira, J Gómez-Luna, L Orosa, S Ghose… - IEEE …, 2021 - ieeexplore.ieee.org
Data movement between the CPU and main memory is a first-order obstacle against improv
ing performance, scalability, and energy efficiency in modern systems. Computer systems …

GraphP: Reducing communication for PIM-based graph processing with efficient data partition

M Zhang, Y Zhuo, C Wang, M Gao, Y Wu… - … Symposium on High …, 2018 - ieeexplore.ieee.org
Processing-In-Memory (PIM) is an effective technique that reduces data movements by
integrating processing units within memory. The recent advance of “big data” and 3D …

Tangram: Optimized coarse-grained dataflow for scalable nn accelerators

M Gao, X Yang, J Pu, M Horowitz… - Proceedings of the Twenty …, 2019 - dl.acm.org
The use of increasingly larger and more complex neural networks (NNs) makes it critical to
scale the capabilities and efficiency of NN accelerators. Tiled architectures provide an …

Graphq: Scalable pim-based graph processing

Y Zhuo, C Wang, M Zhang, R Wang, D Niu… - Proceedings of the …, 2019 - dl.acm.org
Processing-In-Memory (PIM) architectures based on recent technology advances (eg,
Hybrid Memory Cube) demonstrate great potential for graph processing. However, existing …

NERO: A near high-bandwidth memory stencil accelerator for weather prediction modeling

G Singh, D Diamantopoulos… - … Conference on Field …, 2020 - ieeexplore.ieee.org
Ongoing climate change calls for fast and accurate weather and climate modeling. However,
when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU …

KPart: A hybrid cache partitioning-sharing technique for commodity multicores

N El-Sayed, A Mukkara, PA Tsai… - … Symposium on High …, 2018 - ieeexplore.ieee.org
Cache partitioning is now available in commercial hardware. In theory, software can
leverage cache partitioning to use the last-level cache better and improve performance. In …

Mira: A program-behavior-guided far memory system

Z Guo, Z He, Y Zhang - Proceedings of the 29th Symposium on …, 2023 - dl.acm.org
Far memory, where memory accesses are non-local, has become more popular in recent
years as a solution to expand memory size and avoid memory stranding. Prior far memory …

Veloc: Towards high performance adaptive asynchronous checkpointing at large scale

B Nicolae, A Moody, E Gonsiorowski… - 2019 IEEE …, 2019 - ieeexplore.ieee.org
Global checkpointing to external storage (eg, a parallel file system) is a common I/O pattern
of many HPC applications. However, given the limited I/O throughput of external storage …