Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs

R Jain, VM Bhasi, A Jog… - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
Personalized recommendation is a ubiquitous appli-cation on the internet, with many
industries and hyperscalers extensively leveraging Deep Learning Recommendation …

SMILE: LLC-based Shared Memory Expansion to Improve GPU Thread Level Parallelism

T Guo, X Huang, K Wu, X Zhang, N **ao - … of the 61st ACM/IEEE Design …, 2024 - dl.acm.org
While designed for massive parallelism, GPUs are frequently suffering from low thread
occupancy and limited data throughput, which are typically attributed to constrained on-chip …

[PDF][PDF] Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory

J Ren, B Ma, S Yang, B Francis, EK Ardestani, M Si… - pasalabs.org
Deep learning recommendation models (DLRMs) are widely used in industry, and their
memory capacity requirements reach the terabyte scale. Tiered memory architectures …