Snake: A variable-length chain-based prefetching for gpus
Graphics Processing Units (GPUs) utilize memory hierarchy and Thread-Level Parallelism
(TLP) to tolerate off-chip memory latency, which is a significant bottleneck for memory-bound …
(TLP) to tolerate off-chip memory latency, which is a significant bottleneck for memory-bound …
Agents of Autonomy: A Systematic Study of Robotics on Modern Hardware
As robots increasingly permeate modern society, it is crucial for the system and hardware
research community to bridge its long-standing gap with robotics. This divide has persisted …
research community to bridge its long-standing gap with robotics. This divide has persisted …
Morpheus: Extending the last level cache capacity in GPU systems using idle GPU core resources
Graphics Processing Units (GPUs) are widely-used accelerators for data-parallel
applications. In many GPU applications, GPU memory bandwidth bottlenecks performance …
applications. In many GPU applications, GPU memory bandwidth bottlenecks performance …
Cross-core Data Sharing for Energy-efficient GPUs
Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application
domains, because they can accelerate massively parallel workloads and can be easily …
domains, because they can accelerate massively parallel workloads and can be easily …
Tartan: Microarchitecting a Robotic Processor
This paper presents Tartan, a CPU architecture designed for a wide range of robotic
applications. Tartan provides architectural support for common robotic kernels, ensuring its …
applications. Tartan provides architectural support for common robotic kernels, ensuring its …
Slightly Off-Axis Digital Holography Using a Transmission Grating and GPU-Accelerated Parallel Phase Reconstruction
H Bai, J Chen, L Sun, L Li, J Zhang - Photonics, 2023 - mdpi.com
Slightly off-axis digital holography is proposed using transmission grating to obtain
quantitative phase distribution. The experimental device is based on an improved 4f optical …
quantitative phase distribution. The experimental device is based on an improved 4f optical …
SMILE: LLC-based Shared Memory Expansion to Improve GPU Thread Level Parallelism
While designed for massive parallelism, GPUs are frequently suffering from low thread
occupancy and limited data throughput, which are typically attributed to constrained on-chip …
occupancy and limited data throughput, which are typically attributed to constrained on-chip …
Systematic Review of Accelerating Time-Series Biosignal Machine Learning Processes Using GPU Architectures
E Ketola, M Imtiaz - 2024 - preprints.org
Background: Time-series biosignal data, representative of a physiological process, is often
applied to time-sensitive machine learning applications that benefit from acceleration …
applied to time-sensitive machine learning applications that benefit from acceleration …
A Bandwidth-Adaptive On-Chip Storage Network Architecture
K Wang, N Yu, D Tian, L Yang… - 2024 9th International …, 2024 - ieeexplore.ieee.org
A scalable bandwidth-adaptive on-chip storage network architecture is proposed to address
the severe data conflict and low bus parallelism in existing multi-level storage, Crossbar …
the severe data conflict and low bus parallelism in existing multi-level storage, Crossbar …
Pomelo: Alternative mechanism of threads communication for accelerating convolution on SIMT based processor
Z Feng, L Yang, Y Zhang - 2024 9th International Conference …, 2024 - ieeexplore.ieee.org
Single Instruction Multiple Thread (SIMT) based processor and parallel model are effective
ways to solve computation problems exist in big data era. Commonly, work load is organized …
ways to solve computation problems exist in big data era. Commonly, work load is organized …