- Academic Search

H Muthukrishnan, D Lustig, D Nellans… - MICRO-54: 54th Annual …, 2021 - dl.acm.org

Suboptimal management of memory and bandwidth is one of the primary causes of low
performance on systems comprising multiple GPUs. Existing memory management solutions …

Save Cite Cited by 19 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] supercomputing.org

Almost deterministic work stealing

S Shiina, K Taura - Proceedings of the International Conference for High …, 2019 - dl.acm.org

With task parallel models, programmers can easily parallelize divide-and-conquer
algorithms by using nested fork-join structures. Work stealing, which is a popular scheduling …

Save Cite Cited by 29 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] harinimuthukrishnan.net

Finepack: Transparently improving the efficiency of fine-grained transfers in multi-gpu systems

H Muthukrishnan, D Lustig, O Villa… - … Symposium on High …, 2023 - ieeexplore.ieee.org

Recent studies have shown that using fine-grained peer-to-peer (P2P) stores to
communicate among devices in multi-GPU systems is a promising path to achieve strong …

Save Cite Cited by 7 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] hal.science

Scalable task parallelism for numa: A uniform abstraction for coordinated scheduling and memory management

A Drebes, A Pop, K Heydemann, A Cohen… - Proceedings of the 2016 …, 2016 - dl.acm.org

Dynamic task-parallel programming models are popular on shared-memory systems,
promising enhanced scalability, load balancing and locality. Yet these promises are …

Save Cite Cited by 44 Related articles All 9 versions Free GPT-4

[Free GPT-4]

[PDF] google.com

Efficient multi-gpu shared memory via automatic optimization of fine-grained transfers

H Muthukrishnan, D Nellans, D Lustig… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org

Despite continuing research into inter-GPU communication mechanisms, extracting
performance from multi-GPU systems remains a significant challenge. Inter-GPU …

Save Cite Cited by 17 Related articles All 5 versions Free GPT-4

[Free GPT-4]

[PDF] acm.org

Reducing data movement on large shared memory systems by exploiting computation dependencies

I Sánchez Barrera, M Moretó, E Ayguadé… - Proceedings of the …, 2018 - dl.acm.org

Shared memory systems are becoming increasingly complex as they typically integrate
several storage devices. That brings different access latencies or bandwidth rates …

Save Cite Cited by 30 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] hal.science

Using data dependencies to improve task-based scheduling strategies on NUMA architectures

P Virouleau, F Broquedis, T Gautier… - Euro-Par 2016: Parallel …, 2016 - Springer

The recent addition of data dependencies to the OpenMP 4.0 standard provides the
application programmer with a more flexible way of synchronizing tasks. Using such an …

Save Cite Cited by 36 Related articles All 6 versions Free GPT-4

[Free GPT-4]

[PDF] upc.edu

Reducing cache coherence traffic with a numa-aware runtime approach

P Caheny, L Alvarez, S Derradji… - … on Parallel and …, 2017 - ieeexplore.ieee.org

Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the
benefits they provide for scaling core count and memory capacity. Also, the flat memory …

Save Cite Cited by 24 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

DuctTeip: An efficient programming model for distributed task-based parallel computing

A Zafari, E Larsson, M Tillenius - Parallel Computing, 2019 - Elsevier

Current high-performance computer systems used for scientific computing typically combine
shared memory computational nodes in a distributed memory environment. Extracting high …

Save Cite Cited by 22 Related articles All 8 versions Free GPT-4

[Free GPT-4]

[PDF] superfri.org

Dense matrix computations on numa architectures with distance-aware work stealing

R Al-Omairy, G Miranda, H Ltaief, RM Badia… - Supercomputing …, 2015 - superfri.org

We employ the dynamic runtime system OmpSs to decrease the overhead of data motion in
the now ubiquitous non-uniform memory access (NUMA) high concurrency environment of …

Save Cite Cited by 31 Related articles All 13 versions Free GPT-4 View as HTML

Create alert

Cite

Advanced search

Saved to My library

Topology-aware and dependence-aware scheduling and memory allocation for task-parallel languages

Gps: A global publish-subscribe model for multi-gpu memory management

Almost deterministic work stealing

Finepack: Transparently improving the efficiency of fine-grained transfers in multi-gpu systems

Scalable task parallelism for numa: A uniform abstraction for coordinated scheduling and memory management

Efficient multi-gpu shared memory via automatic optimization of fine-grained transfers

Reducing data movement on large shared memory systems by exploiting computation dependencies

Using data dependencies to improve task-based scheduling strategies on NUMA architectures

Reducing cache coherence traffic with a numa-aware runtime approach

DuctTeip: An efficient programming model for distributed task-based parallel computing

Dense matrix computations on numa architectures with distance-aware work stealing