A survey of coarse-grained reconfigurable architecture and design: Taxonomy, challenges, and applications
As general-purpose processors have hit the power wall and chip fabrication cost escalates
alarmingly, coarse-grained reconfigurable architectures (CGRAs) are attracting increasing …
alarmingly, coarse-grained reconfigurable architectures (CGRAs) are attracting increasing …
Simba: Scaling deep-learning inference with multi-chip-module-based architecture
Package-level integration using multi-chip-modules (MCMs) is a promising approach for
building large-scale systems. Compared to a large monolithic die, an MCM combines many …
building large-scale systems. Compared to a large monolithic die, an MCM combines many …
DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks
Data movement between the CPU and main memory is a first-order obstacle against improv
ing performance, scalability, and energy efficiency in modern systems. Computer systems …
ing performance, scalability, and energy efficiency in modern systems. Computer systems …
GraphP: Reducing communication for PIM-based graph processing with efficient data partition
Processing-In-Memory (PIM) is an effective technique that reduces data movements by
integrating processing units within memory. The recent advance of “big data” and 3D …
integrating processing units within memory. The recent advance of “big data” and 3D …
Tangram: Optimized coarse-grained dataflow for scalable nn accelerators
The use of increasingly larger and more complex neural networks (NNs) makes it critical to
scale the capabilities and efficiency of NN accelerators. Tiled architectures provide an …
scale the capabilities and efficiency of NN accelerators. Tiled architectures provide an …
Graphq: Scalable pim-based graph processing
Processing-In-Memory (PIM) architectures based on recent technology advances (eg,
Hybrid Memory Cube) demonstrate great potential for graph processing. However, existing …
Hybrid Memory Cube) demonstrate great potential for graph processing. However, existing …
NERO: A near high-bandwidth memory stencil accelerator for weather prediction modeling
Ongoing climate change calls for fast and accurate weather and climate modeling. However,
when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU …
when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU …
KPart: A hybrid cache partitioning-sharing technique for commodity multicores
Cache partitioning is now available in commercial hardware. In theory, software can
leverage cache partitioning to use the last-level cache better and improve performance. In …
leverage cache partitioning to use the last-level cache better and improve performance. In …
Mira: A program-behavior-guided far memory system
Far memory, where memory accesses are non-local, has become more popular in recent
years as a solution to expand memory size and avoid memory stranding. Prior far memory …
years as a solution to expand memory size and avoid memory stranding. Prior far memory …
Veloc: Towards high performance adaptive asynchronous checkpointing at large scale
Global checkpointing to external storage (eg, a parallel file system) is a common I/O pattern
of many HPC applications. However, given the limited I/O throughput of external storage …
of many HPC applications. However, given the limited I/O throughput of external storage …