Unleashing fine-grained parallelism on embedded many-core accelerators with lightweight OpenMP tasking

G Tagliavini, D Cesarini… - IEEE Transactions on …, 2018 - ieeexplore.ieee.org
In recent years, programmable many-core accelerators (PMCAs) have been introduced in
embedded systems to satisfy stringent performance/Watt requirements. This has increased …

Comparison of threading programming models

S Salehian, J Liu, Y Yan - 2017 IEEE International Parallel and …, 2017 - ieeexplore.ieee.org
In this paper, we provide comparison of language features and runtime systems of
commonly used threading parallel programming models for high performance computing …

Grain graphs: OpenMP performance analysis made easy

A Muddukrishna, PA Jonsson, A Podobas… - Proceedings of the 21st …, 2016 - dl.acm.org
Average programmers struggle to solve performance problems in OpenMP programs with
tasks and parallel for-loops. Existing performance analysis tools visualize OpenMP task …

{Callisto-RTS}:{Fine-Grain} Parallel Loops

T Harris, S Kaestle - … Annual Technical Conference (USENIX ATC 15), 2015 - usenix.org
We introduce Callisto-RTS, a parallel runtime system designed for multi-socket shared-
memory machines. It supports very fine-grained scheduling of parallel loops—down to …

Parallel Cholesky Factorization for Banded Matrices Using OpenMP Tasks

F Liu, A Fredriksson, S Markidis - European Conference on Parallel …, 2023 - Springer
Cholesky factorization is a method for solving linear systems involving symmetric, positive-
definite matrices, and can be an attractive choice in applications where a high degree of …

The tiny-tasks granularity trade-off: Balancing overhead versus performance in parallel systems

S Bora, B Walker, M Fidler - IEEE Transactions on Parallel and …, 2023 - ieeexplore.ieee.org
Models of parallel processing systems typically assume that one has workers and jobs are
split into an equal number of tasks. Splitting jobs into smaller tasks, ie using “tiny tasks”, can …

Analyzing the performance trade-off in implementing user-level threads

S Iwasaki, A Amer, K Taura… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
User-level threads have been widely adopted as a means of achieving lightweight
concurrent execution without the costs of OS-level threads. Nevertheless, the costs of …

Solvers for Electronic Structure in the Strong Scaling Limit

N Bock, M Challacombe, LV Kalé - SIAM Journal on Scientific Computing, 2016 - SIAM
We present a hybrid OpenMP/Charm\tt++ framework for solving the O(N) self-consistent-field
eigenvalue problem with parallelism in the strong scaling regime, P≫N, where P is the …

Lessons learned from analyzing dynamic promotion for user-level threading

S Iwasaki, A Amer, K Taura… - … Conference for High …, 2018 - ieeexplore.ieee.org
A performance vs. practicality trade-off exists between user-level threading techniques. The
community has settled mostly on a black-and-white perspective; fully fledged threads …

[HTML][HTML] Exploring Fine-grained Task Parallelism on Simultaneous Multithreading Cores

D Los, I Petushkov - International Journal of Open Information …, 2024 - cyberleninka.ru
Nowadays, latency-critical, high-performance applications are parallelized even on power-
constrained client systems to improve performance. However, an important scenario of fine …