Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs
Personalized recommendation is a ubiquitous appli-cation on the internet, with many
industries and hyperscalers extensively leveraging Deep Learning Recommendation …
industries and hyperscalers extensively leveraging Deep Learning Recommendation …
SMILE: LLC-based Shared Memory Expansion to Improve GPU Thread Level Parallelism
T Guo, X Huang, K Wu, X Zhang, N **ao - … of the 61st ACM/IEEE Design …, 2024 - dl.acm.org
While designed for massive parallelism, GPUs are frequently suffering from low thread
occupancy and limited data throughput, which are typically attributed to constrained on-chip …
occupancy and limited data throughput, which are typically attributed to constrained on-chip …
[PDF][PDF] Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory
Deep learning recommendation models (DLRMs) are widely used in industry, and their
memory capacity requirements reach the terabyte scale. Tiered memory architectures …
memory capacity requirements reach the terabyte scale. Tiered memory architectures …