New attacks and defense for encrypted-address cache

MK Qureshi - Proceedings of the 46th International Symposium on …, 2019 - dl.acm.org
Conflict-based cache attacks can allow an adversary to infer the access pattern of a co-
running application by orchestrating evictions via cache conflicts. Such attacks can be …

Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systems

V Young, A Jaleel, E Bolotin, E Ebrahimi… - 2018 51st Annual …, 2018 - ieeexplore.ieee.org
Historically, improvement in GPU performance has been tightly coupled with transistor
scaling. As Moore's Law slows down, performance of single GPUs may ultimately plateau …

Bandwidth-effective dram cache for gpu s with storage-class memory

J Hong, S Cho, G Park, W Yang… - … Symposium on High …, 2024 - ieeexplore.ieee.org
We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-
Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity …

Abndp: Co-optimizing data access and load balance in near-data processing

B Tian, Q Chen, M Gao - Proceedings of the 28th ACM International …, 2023 - dl.acm.org
Near-Data Processing (NDP) has been a promising architectural paradigm to address the
memory wall challenge for data-intensive applications. Typical NDP systems based on 3D …

Performance evaluation of intel optane memory for managed workloads

S Akram - ACM Transactions on Architecture and Code …, 2021 - dl.acm.org
Intel Optane memory offers non-volatility, byte addressability, and high capacity. It suits
managed workloads that prefer large main memory heaps. We investigate Optane as the …

Baryon: Efficient hybrid memory management with compression and sub-blocking

Y Li, M Gao - 2023 IEEE International Symposium on High …, 2023 - ieeexplore.ieee.org
Hybrid memory systems are able to achieve both high performance and large capacity when
combining fast commodity DDR memories with larger but slower non-volatile memories in a …

Ducati: High-performance address translation by extending tlb reach of gpu-accelerated systems

A Jaleel, E Ebrahimi, S Duncan - ACM Transactions on Architecture and …, 2019 - dl.acm.org
Conventional on-chip TLB hierarchies are unable to fully cover the growing application
working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple …

Reducing load latency with cache level prediction

M Jalili, M Erez - 2022 IEEE International Symposium on High …, 2022 - ieeexplore.ieee.org
High load latency that results from deep cache hierarchies and relatively slow main memory
is an important limiter of single-thread performance. Data prefetch helps reduce this latency …

Enabling design space exploration of dram caches for emerging memory systems

M Babaie, A Akram… - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
The increasing growth of applications' memory capacity and performance demands has led
the CPU vendors to deploy heterogeneous memory systems either within a single system or …

Locality-aware optimizations for improving remote memory latency in multi-gpu systems

L Belayneh, H Ye, KY Chen, D Blaauw… - Proceedings of the …, 2022 - dl.acm.org
With generational gains from transistor scaling, GPUs have been able to accelerate
traditional computation-intensive workloads. But with the obsolescence of Moore's Law …