DietCode: Automatic optimization for dynamic tensor programs

B Zheng, Z Jiang, CH Yu, H Shen… - Proceedings of …, 2022 - proceedings.mlsys.org
Achieving high performance for compute-intensive operators in machine learning (ML)
workloads is a crucial but challenging task. Many ML and system practitioners rely on …

A survey of multi-tenant deep learning inference on gpu

F Yu, D Wang, L Shangguan, M Zhang, C Liu… - arxiv preprint arxiv …, 2022 - arxiv.org
Deep Learning (DL) models have achieved superior performance. Meanwhile, computing
hardware like NVIDIA GPUs also demonstrated strong computing scaling trends with 2x …

Fast tree-field integrators: From low displacement rank to topological transformers

K Choromanski, A Sehanobish… - arxiv preprint arxiv …, 2024 - arxiv.org
We present a new class of fast polylog-linear algorithms based on the theory of structured
matrices (in particular low displacement rank) for integrating tensor fields defined on …

Exploring the Diversity of Multiple Job Deployments over GPUs for Efficient Resource Sharing

T Adufu, J Ha, Y Kim - 2024 International Conference on …, 2024 - ieeexplore.ieee.org
Graphic Processing Units (GPUs) are gradually becoming mainstream computing resource
for efficient execution of applications both on-premises and in the cloud. Currently however …

Pruning one more token is enough: Leveraging latency-workload non-linearities for vision transformers on the edge

NJ Eliopoulos, P Jajal, J Davis, G Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper investigates how to efficiently deploy vision transformers on edge devices for
small workloads. Recent methods reduce the latency of transformer neural networks by …

LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs

Y Zhang, H Jiang, Y Zhu, R Zhang, Y Cao… - The Journal of …, 2023 - Springer
Channel pruning has recently become a widely used model compression method. However,
most existing channel pruning methods only prune to decrease the model size, such as the …

Application-aware Resource Sharing using Software and Hardware Partitioning on Modern GPUs

T Adufu, J Ha, Y Kim - … 2024-2024 IEEE Network Operations and …, 2024 - ieeexplore.ieee.org
Graphic Processing Units (GPUs) are known for the large computing capabilities they offer
users compared to traditional CPUs. However, the issue of resource under-utilization is …

nnPerf: Demystifying DNN Runtime Inference Latency on Mobile Platforms

H Chu, X Zheng, L Liu, H Ma - Proceedings of the 21st ACM Conference …, 2023 - dl.acm.org
We present nnPerf, a real-time on-device profiler designed to collect and analyze the DNN
model run-time inference latency on mobile platforms. nnPerf demystifies the hidden layers …

CNNBooster: Accelerating CNN Inference with Latency-aware Channel Pruning for GPU

Y Zhu, H Jiang, R Zhang, Y Zhang… - 2022 IEEE Intl Conf on …, 2022 - ieeexplore.ieee.org
Channel pruning is one of the mainly used meth-ods in current network model compression.
However, existing channel pruning methods lack effective hardware runtime latency …

Automatic Compiler-based Optimizations for Deep Neural Networks

B Zheng - 2024 - tspace.library.utoronto.ca
Deep neural networks (DNNs) are the current state-of-the-art machine learning algorithms in
various application domains. Due to their importance, it is crucial that we guarantee their …