DietCode: Automatic optimization for dynamic tensor programs
Achieving high performance for compute-intensive operators in machine learning (ML)
workloads is a crucial but challenging task. Many ML and system practitioners rely on …
workloads is a crucial but challenging task. Many ML and system practitioners rely on …
A survey of multi-tenant deep learning inference on gpu
Deep Learning (DL) models have achieved superior performance. Meanwhile, computing
hardware like NVIDIA GPUs also demonstrated strong computing scaling trends with 2x …
hardware like NVIDIA GPUs also demonstrated strong computing scaling trends with 2x …
Fast tree-field integrators: From low displacement rank to topological transformers
We present a new class of fast polylog-linear algorithms based on the theory of structured
matrices (in particular low displacement rank) for integrating tensor fields defined on …
matrices (in particular low displacement rank) for integrating tensor fields defined on …
Exploring the Diversity of Multiple Job Deployments over GPUs for Efficient Resource Sharing
Graphic Processing Units (GPUs) are gradually becoming mainstream computing resource
for efficient execution of applications both on-premises and in the cloud. Currently however …
for efficient execution of applications both on-premises and in the cloud. Currently however …
Pruning one more token is enough: Leveraging latency-workload non-linearities for vision transformers on the edge
This paper investigates how to efficiently deploy vision transformers on edge devices for
small workloads. Recent methods reduce the latency of transformer neural networks by …
small workloads. Recent methods reduce the latency of transformer neural networks by …
LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs
Y Zhang, H Jiang, Y Zhu, R Zhang, Y Cao… - The Journal of …, 2023 - Springer
Channel pruning has recently become a widely used model compression method. However,
most existing channel pruning methods only prune to decrease the model size, such as the …
most existing channel pruning methods only prune to decrease the model size, such as the …
Application-aware Resource Sharing using Software and Hardware Partitioning on Modern GPUs
Graphic Processing Units (GPUs) are known for the large computing capabilities they offer
users compared to traditional CPUs. However, the issue of resource under-utilization is …
users compared to traditional CPUs. However, the issue of resource under-utilization is …
nnPerf: Demystifying DNN Runtime Inference Latency on Mobile Platforms
We present nnPerf, a real-time on-device profiler designed to collect and analyze the DNN
model run-time inference latency on mobile platforms. nnPerf demystifies the hidden layers …
model run-time inference latency on mobile platforms. nnPerf demystifies the hidden layers …
CNNBooster: Accelerating CNN Inference with Latency-aware Channel Pruning for GPU
Y Zhu, H Jiang, R Zhang, Y Zhang… - 2022 IEEE Intl Conf on …, 2022 - ieeexplore.ieee.org
Channel pruning is one of the mainly used meth-ods in current network model compression.
However, existing channel pruning methods lack effective hardware runtime latency …
However, existing channel pruning methods lack effective hardware runtime latency …
Automatic Compiler-based Optimizations for Deep Neural Networks
B Zheng - 2024 - tspace.library.utoronto.ca
Deep neural networks (DNNs) are the current state-of-the-art machine learning algorithms in
various application domains. Due to their importance, it is crucial that we guarantee their …
various application domains. Due to their importance, it is crucial that we guarantee their …