Gps: A global publish-subscribe model for multi-gpu memory management

H Muthukrishnan, D Lustig, D Nellans… - MICRO-54: 54th Annual …, 2021 - dl.acm.org
Suboptimal management of memory and bandwidth is one of the primary causes of low
performance on systems comprising multiple GPUs. Existing memory management solutions …

Almost deterministic work stealing

S Shiina, K Taura - Proceedings of the International Conference for High …, 2019 - dl.acm.org
With task parallel models, programmers can easily parallelize divide-and-conquer
algorithms by using nested fork-join structures. Work stealing, which is a popular scheduling …

Finepack: Transparently improving the efficiency of fine-grained transfers in multi-gpu systems

H Muthukrishnan, D Lustig, O Villa… - … Symposium on High …, 2023 - ieeexplore.ieee.org
Recent studies have shown that using fine-grained peer-to-peer (P2P) stores to
communicate among devices in multi-GPU systems is a promising path to achieve strong …

Scalable task parallelism for numa: A uniform abstraction for coordinated scheduling and memory management

A Drebes, A Pop, K Heydemann, A Cohen… - Proceedings of the 2016 …, 2016 - dl.acm.org
Dynamic task-parallel programming models are popular on shared-memory systems,
promising enhanced scalability, load balancing and locality. Yet these promises are …

Efficient multi-gpu shared memory via automatic optimization of fine-grained transfers

H Muthukrishnan, D Nellans, D Lustig… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org
Despite continuing research into inter-GPU communication mechanisms, extracting
performance from multi-GPU systems remains a significant challenge. Inter-GPU …

Reducing data movement on large shared memory systems by exploiting computation dependencies

I Sánchez Barrera, M Moretó, E Ayguadé… - Proceedings of the …, 2018 - dl.acm.org
Shared memory systems are becoming increasingly complex as they typically integrate
several storage devices. That brings different access latencies or bandwidth rates …

Using data dependencies to improve task-based scheduling strategies on NUMA architectures

P Virouleau, F Broquedis, T Gautier… - Euro-Par 2016: Parallel …, 2016 - Springer
The recent addition of data dependencies to the OpenMP 4.0 standard provides the
application programmer with a more flexible way of synchronizing tasks. Using such an …

Reducing cache coherence traffic with a numa-aware runtime approach

P Caheny, L Alvarez, S Derradji… - … on Parallel and …, 2017 - ieeexplore.ieee.org
Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the
benefits they provide for scaling core count and memory capacity. Also, the flat memory …

DuctTeip: An efficient programming model for distributed task-based parallel computing

A Zafari, E Larsson, M Tillenius - Parallel Computing, 2019 - Elsevier
Current high-performance computer systems used for scientific computing typically combine
shared memory computational nodes in a distributed memory environment. Extracting high …

Dense matrix computations on numa architectures with distance-aware work stealing

R Al-Omairy, G Miranda, H Ltaief, RM Badia… - Supercomputing …, 2015 - superfri.org
We employ the dynamic runtime system OmpSs to decrease the overhead of data motion in
the now ubiquitous non-uniform memory access (NUMA) high concurrency environment of …