Complexity-effective multicore coherence

A Ros, S Kaxiras - Proceedings of the 21st international conference on …, 2012 - dl.acm.org
Much of the complexity and overhead (directory, state bits, invalidations) of a typical
directory coherence implementation stems from the effort to make it" invisible" even to the …

A Tale of Two Paths: Toward a Hybrid Data Plane for Efficient {Far-Memory} Applications

L Chen, S Liu, C Wang, H Ma, Y Qiao, Z Wang… - … USENIX Symposium on …, 2024 - usenix.org
With rapid advances in network hardware, far memory has gained a great deal of traction
due to its ability to break the memory capacity wall. Existing far memory systems fall into one …

System and method for simplifying cache coherence using multiple write policies

S Kaxiras, A Ros - US Patent 9,274,960, 2016 - Google Patents
Abstract System and methods for cache coherence in a multi-core processing environment
having a local/shared cache hierarchy. The system includes multiple processor cores, a …

Locality-centric data and threadblock management for massive GPUs

M Khairy, V Nikiforov, D Nellans… - 2020 53rd Annual IEEE …, 2020 - ieeexplore.ieee.org
Recent work has shown that building GPUs with hundreds of SMs in a single monolithic chip
will not be practical due to slowing growth in transistor density, low chip yields, and …

Compiler support for selective page migration in NUMA architectures

G Piccoli, HN Santos, RE Rodrigues, C Pousa… - Proceedings of the 23rd …, 2014 - dl.acm.org
Current high-performance multicore processors provide users with a non-uniform memory
access model (NUMA). These systems perform better when threads access data on memory …

A software approach for combating asymmetries of non-volatile memories

Y Li, Y Chen, AK Jones - Proceedings of the 2012 ACM/IEEE …, 2012 - dl.acm.org
The recent advances in non-volatile memory technologies promise the delivery of future
high performance and low power computing systems. While these technologies provide …

Locality‐Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

A Muddukrishna, PA Jonsson… - Scientific …, 2015 - Wiley Online Library
Performance degradation due to nonuniform data access latencies has worsened on NUMA
systems and can now be felt on‐chip in manycore processors. Distributing data across …

Practically private: Enabling high performance cmps through compiler-assisted data classification

Y Li, R Melhem, AK Jones - … of the 21st international conference on …, 2012 - dl.acm.org
State-of-the-art chip multiprocessor (CMP) proposals emphasize optimization to deliver
computing power across many types of applications. Potentially significant performance …

Racer: TSO consistency via race detection

A Ros, S Kaxiras - 2016 49th Annual IEEE/ACM International …, 2016 - ieeexplore.ieee.org
Several recent efforts aim to simplify coherence and its associate costs (eg, directory size,
complexity) in multicores. The bulk of these efforts rely on program data-race-free (DRF) …

Temporal-aware mechanism to detect private data in chip multiprocessors

A Ros, B Cuesta, ME Gómez… - … on Parallel Processing, 2013 - ieeexplore.ieee.org
Most of the data referenced by sequential and parallel applications running in current chip
multiprocessors are referenced by only one thread and can be considered as private data. A …