Snake: A variable-length chain-based prefetching for gpus

S Mostofi, H Falahati, N Mahani… - Proceedings of the 56th …, 2023 - dl.acm.org
Graphics Processing Units (GPUs) utilize memory hierarchy and Thread-Level Parallelism
(TLP) to tolerate off-chip memory latency, which is a significant bottleneck for memory-bound …

Agents of Autonomy: A Systematic Study of Robotics on Modern Hardware

M Bakhshalipour, PB Gibbons - … of the ACM on Measurement and …, 2023 - dl.acm.org
As robots increasingly permeate modern society, it is crucial for the system and hardware
research community to bridge its long-standing gap with robotics. This divide has persisted …

Morpheus: Extending the last level cache capacity in GPU systems using idle GPU core resources

S Darabi, M Sadrosadati, N Akbarzadeh… - 2022 55th IEEE/ACM …, 2022 - ieeexplore.ieee.org
Graphics Processing Units (GPUs) are widely-used accelerators for data-parallel
applications. In many GPU applications, GPU memory bandwidth bottlenecks performance …

Cross-core Data Sharing for Energy-efficient GPUs

H Falahati, M Sadrosadati, Q Xu… - ACM Transactions on …, 2024 - dl.acm.org
Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application
domains, because they can accelerate massively parallel workloads and can be easily …

Tartan: Microarchitecting a Robotic Processor

M Bakhshalipour, PB Gibbons - 2024 ACM/IEEE 51st Annual …, 2024 - ieeexplore.ieee.org
This paper presents Tartan, a CPU architecture designed for a wide range of robotic
applications. Tartan provides architectural support for common robotic kernels, ensuring its …

Slightly Off-Axis Digital Holography Using a Transmission Grating and GPU-Accelerated Parallel Phase Reconstruction

H Bai, J Chen, L Sun, L Li, J Zhang - Photonics, 2023 - mdpi.com
Slightly off-axis digital holography is proposed using transmission grating to obtain
quantitative phase distribution. The experimental device is based on an improved 4f optical …

SMILE: LLC-based Shared Memory Expansion to Improve GPU Thread Level Parallelism

T Guo, X Huang, K Wu, X Zhang, N **ao - … of the 61st ACM/IEEE Design …, 2024 - dl.acm.org
While designed for massive parallelism, GPUs are frequently suffering from low thread
occupancy and limited data throughput, which are typically attributed to constrained on-chip …

Systematic Review of Accelerating Time-Series Biosignal Machine Learning Processes Using GPU Architectures

E Ketola, M Imtiaz - 2024 - preprints.org
Background: Time-series biosignal data, representative of a physiological process, is often
applied to time-sensitive machine learning applications that benefit from acceleration …

A Bandwidth-Adaptive On-Chip Storage Network Architecture

K Wang, N Yu, D Tian, L Yang… - 2024 9th International …, 2024 - ieeexplore.ieee.org
A scalable bandwidth-adaptive on-chip storage network architecture is proposed to address
the severe data conflict and low bus parallelism in existing multi-level storage, Crossbar …

Pomelo: Alternative mechanism of threads communication for accelerating convolution on SIMT based processor

Z Feng, L Yang, Y Zhang - 2024 9th International Conference …, 2024 - ieeexplore.ieee.org
Single Instruction Multiple Thread (SIMT) based processor and parallel model are effective
ways to solve computation problems exist in big data era. Commonly, work load is organized …