Gps: A global publish-subscribe model for multi-gpu memory management
Suboptimal management of memory and bandwidth is one of the primary causes of low
performance on systems comprising multiple GPUs. Existing memory management solutions …
performance on systems comprising multiple GPUs. Existing memory management solutions …
Almost deterministic work stealing
With task parallel models, programmers can easily parallelize divide-and-conquer
algorithms by using nested fork-join structures. Work stealing, which is a popular scheduling …
algorithms by using nested fork-join structures. Work stealing, which is a popular scheduling …
Finepack: Transparently improving the efficiency of fine-grained transfers in multi-gpu systems
Recent studies have shown that using fine-grained peer-to-peer (P2P) stores to
communicate among devices in multi-GPU systems is a promising path to achieve strong …
communicate among devices in multi-GPU systems is a promising path to achieve strong …
Scalable task parallelism for numa: A uniform abstraction for coordinated scheduling and memory management
Dynamic task-parallel programming models are popular on shared-memory systems,
promising enhanced scalability, load balancing and locality. Yet these promises are …
promising enhanced scalability, load balancing and locality. Yet these promises are …
Efficient multi-gpu shared memory via automatic optimization of fine-grained transfers
Despite continuing research into inter-GPU communication mechanisms, extracting
performance from multi-GPU systems remains a significant challenge. Inter-GPU …
performance from multi-GPU systems remains a significant challenge. Inter-GPU …
Reducing data movement on large shared memory systems by exploiting computation dependencies
Shared memory systems are becoming increasingly complex as they typically integrate
several storage devices. That brings different access latencies or bandwidth rates …
several storage devices. That brings different access latencies or bandwidth rates …
Using data dependencies to improve task-based scheduling strategies on NUMA architectures
The recent addition of data dependencies to the OpenMP 4.0 standard provides the
application programmer with a more flexible way of synchronizing tasks. Using such an …
application programmer with a more flexible way of synchronizing tasks. Using such an …
Reducing cache coherence traffic with a numa-aware runtime approach
P Caheny, L Alvarez, S Derradji… - … on Parallel and …, 2017 - ieeexplore.ieee.org
Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the
benefits they provide for scaling core count and memory capacity. Also, the flat memory …
benefits they provide for scaling core count and memory capacity. Also, the flat memory …
DuctTeip: An efficient programming model for distributed task-based parallel computing
A Zafari, E Larsson, M Tillenius - Parallel Computing, 2019 - Elsevier
Current high-performance computer systems used for scientific computing typically combine
shared memory computational nodes in a distributed memory environment. Extracting high …
shared memory computational nodes in a distributed memory environment. Extracting high …
Dense matrix computations on numa architectures with distance-aware work stealing
We employ the dynamic runtime system OmpSs to decrease the overhead of data motion in
the now ubiquitous non-uniform memory access (NUMA) high concurrency environment of …
the now ubiquitous non-uniform memory access (NUMA) high concurrency environment of …