Complexity-effective multicore coherence
Much of the complexity and overhead (directory, state bits, invalidations) of a typical
directory coherence implementation stems from the effort to make it" invisible" even to the …
directory coherence implementation stems from the effort to make it" invisible" even to the …
A Tale of Two Paths: Toward a Hybrid Data Plane for Efficient {Far-Memory} Applications
With rapid advances in network hardware, far memory has gained a great deal of traction
due to its ability to break the memory capacity wall. Existing far memory systems fall into one …
due to its ability to break the memory capacity wall. Existing far memory systems fall into one …
System and method for simplifying cache coherence using multiple write policies
Abstract System and methods for cache coherence in a multi-core processing environment
having a local/shared cache hierarchy. The system includes multiple processor cores, a …
having a local/shared cache hierarchy. The system includes multiple processor cores, a …
Locality-centric data and threadblock management for massive GPUs
Recent work has shown that building GPUs with hundreds of SMs in a single monolithic chip
will not be practical due to slowing growth in transistor density, low chip yields, and …
will not be practical due to slowing growth in transistor density, low chip yields, and …
Compiler support for selective page migration in NUMA architectures
G Piccoli, HN Santos, RE Rodrigues, C Pousa… - Proceedings of the 23rd …, 2014 - dl.acm.org
Current high-performance multicore processors provide users with a non-uniform memory
access model (NUMA). These systems perform better when threads access data on memory …
access model (NUMA). These systems perform better when threads access data on memory …
A software approach for combating asymmetries of non-volatile memories
The recent advances in non-volatile memory technologies promise the delivery of future
high performance and low power computing systems. While these technologies provide …
high performance and low power computing systems. While these technologies provide …
Locality‐Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors
Performance degradation due to nonuniform data access latencies has worsened on NUMA
systems and can now be felt on‐chip in manycore processors. Distributing data across …
systems and can now be felt on‐chip in manycore processors. Distributing data across …
Practically private: Enabling high performance cmps through compiler-assisted data classification
State-of-the-art chip multiprocessor (CMP) proposals emphasize optimization to deliver
computing power across many types of applications. Potentially significant performance …
computing power across many types of applications. Potentially significant performance …
Racer: TSO consistency via race detection
Several recent efforts aim to simplify coherence and its associate costs (eg, directory size,
complexity) in multicores. The bulk of these efforts rely on program data-race-free (DRF) …
complexity) in multicores. The bulk of these efforts rely on program data-race-free (DRF) …
Temporal-aware mechanism to detect private data in chip multiprocessors
Most of the data referenced by sequential and parallel applications running in current chip
multiprocessors are referenced by only one thread and can be considered as private data. A …
multiprocessors are referenced by only one thread and can be considered as private data. A …