An optimizing pipeline stall reduction algorithm for power and performance on multi-core CPUs

V Saravanan, KD Pralhaddas, DP Kothari… - … -centric Computing and …, 2015 - Springer
The power-performance trade-off is one of the major considerations in micro-architecture
design. Pipelined architecture has brought a radical change in the design to capitalize on …

Efficient warp execution in presence of divergence with collaborative context collection

F Khorasani, R Gupta, LN Bhuyan - Proceedings of the 48th International …, 2015 - dl.acm.org
GPU's SIMD architecture is a double-edged sword confronting parallel tasks with control
flow divergence. On the one hand, it provides a high performance yet power-efficient …

Generative data models for validation and evaluation of visualization techniques

C Schulz, A Nocaj, M El-Assady, S Frey… - Proceedings of the …, 2016 - dl.acm.org
We argue that there is a need for substantially more research on the use of generative data
models in the validation and evaluation of visualization techniques. For example, user …

Gpu subwarp interleaving

S Damani, M Stephenson, R Rangan… - … Symposium on High …, 2022 - ieeexplore.ieee.org
Raytracing applications have naturally high thread divergence, low warp occupancy and are
limited by memory latency. In this paper, we present an architectural enhancement called …

Speculative reconvergence for improved SIMT efficiency

S Damani, DR Johnson, M Stephenson… - Proceedings of the 18th …, 2020 - dl.acm.org
GPUs perform most efficiently when all threads in a warp execute the same sequence of
instructions convergently. However, when threads in a warp encounter a divergent branch …

Device and method for scheduling multiple thread groups on SIMD lanes upon divergence in a single thread group

SH ** - US Patent 10,831,490, 2020 - Google Patents
Provided are an apparatus and a method for effectively managing threads diverged by a
conditional branch based on Single Instruction Multiple-based Data (SIMD). The appa ratus …

[HTML][HTML] An efficient algorithm for the calculation of sub-grid distances for higher-order LBM boundary conditions in a GPU simulation environment

D Mierke, CF Janßen, T Rung - Computers & Mathematics with Applications, 2020 - Elsevier
This paper presents a new and efficient algorithm for the calculation of sub-grid distances in
the context of a lattice Boltzmann method (LBM). LBMs usually operate on equidistant …

Eliminating intra-warp load imbalance in irregular nested patterns via collaborative task engagement

F Khorasani, B Rowe, R Gupta… - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
Nested patterns are one of the most frequently occurring algorithmic themes in GPU
applications where coarse-grained tasks are constituted from a number of fine-grained ones …

System, method, and computer program product for managing divergences and synchronization points during thread block execution by using a double sided queue …

O Giroux, GF Diamos - US Patent 9,459,876, 2016 - Google Patents
BACKGROUND Threads (ie, an abstract construct of an instance of a program executing on
a processor) have a basic guarantee of forward progress. In other words, if one thread …

CUIRRE: An open-source library for load balancing and characterizing irregular applications on GPUs

T Zhang, W Shu, MY Wu - Journal of parallel and distributed computing, 2014 - Elsevier
Abstract While Graphics Processing Units (GPUs) show high performance for problems with
regular structures, they do not perform well for irregular tasks due to the mismatches …