TOP-PIM: Throughput-oriented programmable processing in memory

D Zhang, N Jayasena, A Lyashevsky… - Proceedings of the 23rd …, 2014 - dl.acm.org
As computation becomes increasingly limited by data movement and energy consumption,
exploiting locality throughout the memory hierarchy becomes critical to continued …

Locality exists in graph processing: Workload characterization on an ivy bridge server

S Beamer, K Asanovic… - 2015 IEEE International …, 2015 - ieeexplore.ieee.org
Graph processing is an increasingly important application domain and is typically
communication-bound. In this work, we analyze the performance characteristics of three …

Modular routing design for chiplet-based systems

J Yin, Z Lin, O Kayiran, M Poremba… - 2018 ACM/IEEE 45th …, 2018 - ieeexplore.ieee.org
System-on-Chip (SoC) complexity and the increasing costs of silicon motivate the breaking
of an SoC into smaller" chiplets." A chiplet-based SoC design process has the promise to …

Alleviating irregularity in graph analytics acceleration: A hardware/software co-design approach

M Yan, X Hu, S Li, A Basak, H Li, X Ma… - Proceedings of the …, 2019 - dl.acm.org
Graph analytics is an emerging application which extracts insights by processing large
volumes of highly connected data, namely graphs. The parallel processing of graphs has …

A compiler for throughput optimization of graph algorithms on GPUs

S Pai, K **ali - Proceedings of the 2016 ACM SIGPLAN International …, 2016 - dl.acm.org
Writing high-performance GPU implementations of graph algorithms can be challenging. In
this paper, we argue that three optimizations called throughput optimizations are key to high …

Crono: A benchmark suite for multithreaded graph algorithms executing on futuristic multicores

M Ahmad, F Hijaz, Q Shi, O Khan - 2015 IEEE International …, 2015 - ieeexplore.ieee.org
Algorithms operating on a graph setting are known to be highly irregular and unstructured.
This leads to workload imbalance and data locality challenge when these algorithms are …

Bandwidth-effective dram cache for gpu s with storage-class memory

J Hong, S Cho, G Park, W Yang… - … Symposium on High …, 2024 - ieeexplore.ieee.org
We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-
Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity …

Graph processing on GPUs: Where are the bottlenecks?

Q Xu, H Jeon, M Annavaram - 2014 IEEE International …, 2014 - ieeexplore.ieee.org
Large graph processing is now a critical component of many data analytics. Graph
processing is used from social networking Web sites that provide context-aware services …

Adaptive page migration for irregular data-intensive applications under gpu memory oversubscription

D Ganguly, Z Zhang, J Yang… - 2020 IEEE International …, 2020 - ieeexplore.ieee.org
Unified Memory in heterogeneous systems serves a wide range of applications. However,
limited capacity of the device memory becomes a first order performance bottleneck for data …

Not all gpus are created equal: characterizing variability in large-scale, accelerator-rich systems

P Sinha, A Guliani, R Jain, B Tran… - … Conference for High …, 2022 - ieeexplore.ieee.org
Scientists are increasingly exploring and utilizing the massive parallelism of general-
purpose accelerators such as GPUs for scientific breakthroughs. As a result, datacenters …