- Academic Search

T Kolev, P Fischer, M Min, J Dongarra… - … Journal of High …, 2021 - journals.sagepub.com

Efficient exploitation of exascale architectures requires rethinking of the numerical
algorithms used in many large-scale applications. These architectures favor algorithms that …

सेव करें उद्धृत करें 67 में हवाला दिया गया मिलते-जुलते लेख सभी 13 वर्शन

[免费ChatGPT] [DeepSeek可用网址] [PDF] usenix.org

{DeepCPU}: Serving {RNN-based} deep learning models 10x faster

M Zhang, S Rajbhandari, W Wang, Y He - 2018 USENIX Annual …, 2018 - usenix.org

Recurrent neural networks (RNNs) are an important class of deep learning (DL) models.
Existing DL frameworks have unsatisfying performance for online serving: many RNN …

सेव करें उद्धृत करें 124 में हवाला दिया गया मिलते-जुलते लेख सभी 13 वर्शन HTML रूप में देखें

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

CLBlast: A tuned OpenCL BLAS library

C Nugteren - Proceedings of the International Workshop on OpenCL, 2018 - dl.acm.org

This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL
routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at …

सेव करें उद्धृत करें 113 में हवाला दिया गया मिलते-जुलते लेख सभी 3 वर्शन

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations

AN Ziogas, T Ben-Nun, GI Fernández… - Proceedings of the …, 2019 - dl.acm.org

The computational efficiency of a state of the art ab initio quantum transport (QT) solver,
capable of revealing the coupled electrothermal properties of atomically-resolved nano …

सेव करें उद्धृत करें 60 में हवाला दिया गया मिलते-जुलते लेख सभी 23 वर्शन

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning Toolkit

F Petrovič, D Střelák, J Hozzová, J Ol'ha… - Future Generation …, 2020 - Elsevier

In recent years, the heterogeneity of both commodity and supercomputers hardware has
increased sharply. Accelerators, such as GPUs or Intel Xeon Phi co-processors, are often …

सेव करें उद्धृत करें 56 में हवाला दिया गया मिलते-जुलते लेख सभी 10 वर्शन

[免费ChatGPT] [DeepSeek可用网址] [PDF] nsf.gov

Fast batched matrix multiplication for small sizes using half-precision arithmetic on GPUs

A Abdelfattah, S Tomov… - 2019 IEEE international …, 2019 - ieeexplore.ieee.org

Matrix multiplication (GEMM) is the most important operation in dense linear algebra.
Because it is a computebound operation that is rich in data reuse, many applications from …

सेव करें उद्धृत करें 51 में हवाला दिया गया मिलते-जुलते लेख सभी 10 वर्शन

[免费ChatGPT] [DeepSeek可用网址] [PDF] sciencedirect.com

High-performance tensor contractions for GPUs

A Abdelfattah, M Baboulin, V Dobrev… - Procedia Computer …, 2016 - Elsevier

We present a computational framework for high-performance tensor contractions on GPUs.
High-performance is difficult to obtain using existing libraries, especially for many …

सेव करें उद्धृत करें 87 में हवाला दिया गया मिलते-जुलते लेख सभी 19 वर्शन

[免费ChatGPT] [DeepSeek可用网址] [PDF] acm.org

Transmuter: Bridging the efficiency gap using memory and dataflow reconfiguration

S Pal, S Feng, D Park, S Kim, A Amarnath… - Proceedings of the …, 2020 - dl.acm.org

With the end of Dennard scaling and Moore's law, it is becoming increasingly difficult to build
hardware for emerging applications that meet power and performance targets, while …

सेव करें उद्धृत करें 32 में हवाला दिया गया मिलते-जुलते लेख सभी 10 वर्शन

[免费ChatGPT] [DeepSeek可用网址] [PDF] acm.org

A set of batched basic linear algebra subprograms and LAPACK routines

A Abdelfattah, T Costa, J Dongarra, M Gates… - ACM Transactions on …, 2021 - dl.acm.org

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms
(Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small …

सेव करें उद्धृत करें 31 में हवाला दिया गया मिलते-जुलते लेख सभी 6 वर्शन

[免费ChatGPT] [DeepSeek可用网址] [PDF] nsf.gov

Design, optimization, and benchmarking of dense linear algebra algorithms on AMD GPUs

C Brown, A Abdelfattah, S Tomov… - 2020 IEEE High …, 2020 - ieeexplore.ieee.org

Dense linear algebra (DLA) has historically been in the vanguard of software that must be
adapted first to hardware changes. This is because DLA is both critical to the accuracy and …

सेव करें उद्धृत करें 29 में हवाला दिया गया मिलते-जुलते लेख सभी 8 वर्शन

अलर्ट बनाएं

उद्धृत करें

बेहतर खोज

मेरी लाइब्रेरी में सेव किया गया

High-performance matrix-matrix multiplications of very small matrices

Efficient exascale discretizations: High-order finite element methods

{DeepCPU}: Serving {RNN-based} deep learning models 10x faster

CLBlast: A tuned OpenCL BLAS library

A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations

A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning Toolkit

Fast batched matrix multiplication for small sizes using half-precision arithmetic on GPUs

High-performance tensor contractions for GPUs

Transmuter: Bridging the efficiency gap using memory and dataflow reconfiguration

A set of batched basic linear algebra subprograms and LAPACK routines

Design, optimization, and benchmarking of dense linear algebra algorithms on AMD GPUs