Unleashing fine-grained parallelism on embedded many-core accelerators with lightweight OpenMP tasking
In recent years, programmable many-core accelerators (PMCAs) have been introduced in
embedded systems to satisfy stringent performance/Watt requirements. This has increased …
embedded systems to satisfy stringent performance/Watt requirements. This has increased …
Comparison of threading programming models
In this paper, we provide comparison of language features and runtime systems of
commonly used threading parallel programming models for high performance computing …
commonly used threading parallel programming models for high performance computing …
Grain graphs: OpenMP performance analysis made easy
Average programmers struggle to solve performance problems in OpenMP programs with
tasks and parallel for-loops. Existing performance analysis tools visualize OpenMP task …
tasks and parallel for-loops. Existing performance analysis tools visualize OpenMP task …
{Callisto-RTS}:{Fine-Grain} Parallel Loops
We introduce Callisto-RTS, a parallel runtime system designed for multi-socket shared-
memory machines. It supports very fine-grained scheduling of parallel loops—down to …
memory machines. It supports very fine-grained scheduling of parallel loops—down to …
Parallel Cholesky Factorization for Banded Matrices Using OpenMP Tasks
Cholesky factorization is a method for solving linear systems involving symmetric, positive-
definite matrices, and can be an attractive choice in applications where a high degree of …
definite matrices, and can be an attractive choice in applications where a high degree of …
The tiny-tasks granularity trade-off: Balancing overhead versus performance in parallel systems
Models of parallel processing systems typically assume that one has workers and jobs are
split into an equal number of tasks. Splitting jobs into smaller tasks, ie using “tiny tasks”, can …
split into an equal number of tasks. Splitting jobs into smaller tasks, ie using “tiny tasks”, can …
Analyzing the performance trade-off in implementing user-level threads
User-level threads have been widely adopted as a means of achieving lightweight
concurrent execution without the costs of OS-level threads. Nevertheless, the costs of …
concurrent execution without the costs of OS-level threads. Nevertheless, the costs of …
Solvers for Electronic Structure in the Strong Scaling Limit
We present a hybrid OpenMP/Charm\tt++ framework for solving the O(N) self-consistent-field
eigenvalue problem with parallelism in the strong scaling regime, P≫N, where P is the …
eigenvalue problem with parallelism in the strong scaling regime, P≫N, where P is the …
Lessons learned from analyzing dynamic promotion for user-level threading
A performance vs. practicality trade-off exists between user-level threading techniques. The
community has settled mostly on a black-and-white perspective; fully fledged threads …
community has settled mostly on a black-and-white perspective; fully fledged threads …
[HTML][HTML] Exploring Fine-grained Task Parallelism on Simultaneous Multithreading Cores
D Los, I Petushkov - International Journal of Open Information …, 2024 - cyberleninka.ru
Nowadays, latency-critical, high-performance applications are parallelized even on power-
constrained client systems to improve performance. However, an important scenario of fine …
constrained client systems to improve performance. However, an important scenario of fine …