Google Académico

B Zheng, Z Jiang, CH Yu, H Shen… - Proceedings of …, 2022 - proceedings.mlsys.org

Achieving high performance for compute-intensive operators in machine learning (ML)
workloads is a crucial but challenging task. Many ML and system practitioners rely on …

Guardar Citar Citado por 38 Artículos relacionados Las 3 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

A survey of multi-tenant deep learning inference on gpu

F Yu, D Wang, L Shangguan, M Zhang, C Liu… - arxiv preprint arxiv …, 2022 - arxiv.org

Deep Learning (DL) models have achieved superior performance. Meanwhile, computing
hardware like NVIDIA GPUs also demonstrated strong computing scaling trends with 2x …

Guardar Citar Citado por 35 Artículos relacionados Las 5 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Fast tree-field integrators: From low displacement rank to topological transformers

K Choromanski, A Sehanobish… - arxiv preprint arxiv …, 2024 - arxiv.org

We present a new class of fast polylog-linear algorithms based on the theory of structured
matrices (in particular low displacement rank) for integrating tensor fields defined on …

Guardar Citar Citado por 1 Artículos relacionados Las 3 versiones Versión en HTML

Exploring the Diversity of Multiple Job Deployments over GPUs for Efficient Resource Sharing

T Adufu, J Ha, Y Kim - 2024 International Conference on …, 2024 - ieeexplore.ieee.org

Graphic Processing Units (GPUs) are gradually becoming mainstream computing resource
for efficient execution of applications both on-premises and in the cloud. Currently however …

Guardar Citar Citado por 3 Artículos relacionados Las 3 versiones

[Free GPT-4]

[PDF] arxiv.org

Pruning one more token is enough: Leveraging latency-workload non-linearities for vision transformers on the edge

NJ Eliopoulos, P Jajal, J Davis, G Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

This paper investigates how to efficiently deploy vision transformers on edge devices for
small workloads. Recent methods reduce the latency of transformer neural networks by …

Guardar Citar Citado por 1 Artículos relacionados Las 2 versiones Versión en HTML

LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs

Y Zhang, H Jiang, Y Zhu, R Zhang, Y Cao… - The Journal of …, 2023 - Springer

Channel pruning has recently become a widely used model compression method. However,
most existing channel pruning methods only prune to decrease the model size, such as the …

Guardar Citar Citado por 2 Artículos relacionados Las 3 versiones

Application-aware Resource Sharing using Software and Hardware Partitioning on Modern GPUs

T Adufu, J Ha, Y Kim - … 2024-2024 IEEE Network Operations and …, 2024 - ieeexplore.ieee.org

Graphic Processing Units (GPUs) are known for the large computing capabilities they offer
users compared to traditional CPUs. However, the issue of resource under-utilization is …

Guardar Citar Citado por 1 Artículos relacionados Las 2 versiones

[Free GPT-4]

[PDF] github.io

nnPerf: Demystifying DNN Runtime Inference Latency on Mobile Platforms

H Chu, X Zheng, L Liu, H Ma - Proceedings of the 21st ACM Conference …, 2023 - dl.acm.org

We present nnPerf, a real-time on-device profiler designed to collect and analyze the DNN
model run-time inference latency on mobile platforms. nnPerf demystifies the hidden layers …

Guardar Citar Citado por 5 Artículos relacionados

CNNBooster: Accelerating CNN Inference with Latency-aware Channel Pruning for GPU

Y Zhu, H Jiang, R Zhang, Y Zhang… - 2022 IEEE Intl Conf on …, 2022 - ieeexplore.ieee.org

Channel pruning is one of the mainly used meth-ods in current network model compression.
However, existing channel pruning methods lack effective hardware runtime latency …

Guardar Citar Citado por 1 Artículos relacionados Las 2 versiones

[Free GPT-4]

[PDF] utoronto.ca

Automatic Compiler-based Optimizations for Deep Neural Networks

B Zheng - 2024 - tspace.library.utoronto.ca

Deep neural networks (DNNs) are the current state-of-the-art machine learning algorithms in
various application domains. Due to their importance, it is crucial that we guarantee their …

Guardar Citar Artículos relacionados Versión en HTML

Crear alerta

Citar

Búsqueda avanzada

Guardado en Mi biblioteca

Towards latency-aware dnn optimization with gpu runtime analysis and tail effect elimination

DietCode: Automatic optimization for dynamic tensor programs

A survey of multi-tenant deep learning inference on gpu

Fast tree-field integrators: From low displacement rank to topological transformers

Exploring the Diversity of Multiple Job Deployments over GPUs for Efficient Resource Sharing

Pruning one more token is enough: Leveraging latency-workload non-linearities for vision transformers on the edge

LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs

Application-aware Resource Sharing using Software and Hardware Partitioning on Modern GPUs

nnPerf: Demystifying DNN Runtime Inference Latency on Mobile Platforms

CNNBooster: Accelerating CNN Inference with Latency-aware Channel Pruning for GPU

Automatic Compiler-based Optimizations for Deep Neural Networks