Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers

Q Chen, H Yang, M Guo, RS Kannan, J Mars… - Proceedings of the …, 2017 - dl.acm.org
Guaranteeing Quality-of-Service (QoS) of latency-sensitive applications while improving
server utilization through application co-location is important yet challenging in modern …

The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs

N Vijaykumar, E Ebrahimi, K Hsieh… - 2018 ACM/IEEE 45th …, 2018 - ieeexplore.ieee.org
Exploiting data locality in GPUs is critical to making more efficient use of the existing caches
and the NUMA-based memory hierarchy expected in future GPUs. While modern GPU …

Coda: Enabling co-location of computation and data for multiple gpu systems

H Kim, R Hadidi, L Nai, H Kim, N Jayasena… - ACM Transactions on …, 2018 - dl.acm.org
To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place
compute and data together. However, two key techniques that have been used to hide …

CUDASTF: Bridging the Gap Between CUDA and Task Parallelism

C Augonnet, A Alexandrescu… - … Conference for High …, 2024 - ieeexplore.ieee.org
Organizing computation as asynchronous tasks with data-driven dependencies is a simple
and efficient model for single-and multi-GPU programs. Sequential Task Flow (STF) is such …

Converting data-parallelism to task-parallelism by rewrites: purely functional programs across multiple GPUs

BJ Svensson, M Vollmer, E Holk, TL McDonell… - Proceedings of the 4th …, 2015 - dl.acm.org
High-level domain-specific languages for array processing on the GPU are increasingly
common, but they typically only run on a single GPU. As computational power is distributed …

Homp: Automated distribution of parallel loops and data in highly parallel accelerator-based systems

Y Yan, J Liu, KW Cameron… - 2017 IEEE International …, 2017 - ieeexplore.ieee.org
Heterogeneous computing systems, eg, those with accelerators than the host CPUs, offer
the accelerated performance for a variety of workloads. However, most parallel …

Dynamic Task Scheduling Scheme for a GPGPU Programming Framework

K Ohno, R Yamamoto - 2015 Third International Symposium on …, 2015 - ieeexplore.ieee.org
The computational power and the physical memory size of a single GPU device are often
insufficient for large-scale problems. Using CUDA, the user must explicitly partition such …

Dynamic task scheduling scheme for a GPGPU programming framework

K Ohno, R Yamamoto, H Tanaka - International Journal of …, 2016 - jstage.jst.go.jp
The computational power and the physical memory size of a single GPU device are often
insufficient for large-scale problems. Using CUDA, the user must explicitly partition such …

Enhancing Programmability, Portability, and Performance with Rich Cross-layer Abstractions

N Vijaykumar - 2019 - search.proquest.com
Programmability, performance portability, and resource efficiency have emerged as critical
challenges in harnessing complex and diverse architectures today to obtain high …

3-D Viewer for interpretation of multiple scan sections

B Baxter - Proceedings of the May 19-22, 1980, national …, 1980 - dl.acm.org
A new viewing device is being constructed which will allow a physician to examine multiple
scan sections simultaneously in their proper orientation in all three dimensions. Test images …