Affinity-based thread and data map** in shared memory systems
Shared memory architectures have recently experienced a large increase in thread-level
parallelism, leading to complex memory hierarchies with multiple cache memory levels and …
parallelism, leading to complex memory hierarchies with multiple cache memory levels and …
Argobots: A lightweight low-level threading and tasking framework
In the past few decades, a number of user-level threading and tasking models have been
proposed in the literature to address the shortcomings of OS-level threads, primarily with …
proposed in the literature to address the shortcomings of OS-level threads, primarily with …
memif Towards Programming Heterogeneous Memory Asynchronously
To harness a heterogeneous memory hierarchy, it is advantageous to integrate application
knowledge in guiding frequent memory move, ie, replicating or migrating virtual memory …
knowledge in guiding frequent memory move, ie, replicating or migrating virtual memory …
A tool to analyze the performance of multithreaded programs on NUMA architectures
Almost all of today's microprocessors contain memory controllers and directly attach to
memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is …
memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is …
Learning intermediate representations using graph neural networks for numa and prefetchers optimization
There is a large space of NUMA and hardware prefetcher configurations that can
significantly impact the performance of an application. Previous studies have demonstrated …
significantly impact the performance of an application. Previous studies have demonstrated …
Locality-centric data and threadblock management for massive GPUs
Recent work has shown that building GPUs with hundreds of SMs in a single monolithic chip
will not be practical due to slowing growth in transistor density, low chip yields, and …
will not be practical due to slowing growth in transistor density, low chip yields, and …
Modeling and optimizing numa effects and prefetching with machine learning
Both NUMA thread/data placement and hardware prefetcher configuration have significant
impacts on HPC performance. Optimizing both together leads to a large and complex design …
impacts on HPC performance. Optimizing both together leads to a large and complex design …
Efficient thread/page/parallelism autotuning for numa systems
Current multi-socket systems have complex memory hierarchies with significant Non-
Uniform Memory Access (NUMA) effects: memory performance depends on the location of …
Uniform Memory Access (NUMA) effects: memory performance depends on the location of …
Scalable task parallelism for numa: A uniform abstraction for coordinated scheduling and memory management
Dynamic task-parallel programming models are popular on shared-memory systems,
promising enhanced scalability, load balancing and locality. Yet these promises are …
promising enhanced scalability, load balancing and locality. Yet these promises are …
Numamma: Numa memory analyzer
Non Uniform Memory Access (NUMA) architectures are nowadays common for running High-
Performance Computing (HPC) applications. In such architectures, several distinct physical …
Performance Computing (HPC) applications. In such architectures, several distinct physical …