Optimization techniques for GPU programming

P Hijma, S Heldens, A Sclocco… - ACM Computing …, 2023 - dl.acm.org
In the past decade, Graphics Processing Units have played an important role in the field of
high-performance computing and they still advance new fields such as IoT, autonomous …

Optimizing depthwise separable convolution operations on gpus

G Lu, W Zhang, Z Wang - IEEE Transactions on Parallel and …, 2021 - ieeexplore.ieee.org
The depthwise separable convolution is commonly seen in convolutional neural networks
(CNNs), and is widely used to reduce the computation overhead of a standard multi-channel …

Multi-level encoding and decoding in a scalable photonic tensor processor with a photonic general matrix multiply (GeMM) compiler

Z Guo, AN Tait, BA Marquez… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
The resurgence of artificial intelligence enabled by deep learning and high performance
computing has seen a dramatic increase of demand in the accuracy of deep learning model …

Towards functional safety compliance of matrix–matrix multiplication for machine learning-based autonomous systems

J Fernández, J Perez, I Agirre, I Allende, J Abella… - Journal of Systems …, 2021 - Elsevier
Autonomous systems execute complex tasks to perceive the environment and take self-
aware decisions with limited human interaction. This autonomy is commonly achieved with …

NIOT: A Novel Inference Optimization of Transformers on Modern CPUs

Z Zhang, Y Chen, B He, Z Zhang - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
In the machine learning era, model inference efficiency is one of the most important issues
for machine learning systems. It is a major challenge to find the optimal configuration in a …

A methodology for efficient tile size selection for affine loop kernels

V Kelefouras, K Djemame, G Keramidas… - International Journal of …, 2022 - Springer
Reducing the number of data accesses in memory hierarchy is of paramount importance on
modern computer systems. One of the key optimizations addressing this problem is loop …

Full-wave-equation depth extrapolation for migration using matrix multiplication

J You, J Cao - Geophysics, 2020 - library.seg.org
To investigate wavefield depth extrapolation using the full-wave equation, we have derived
a new depth extrapolation scheme for migration using functions of the vertical wavenumber …

How does the performance of NEAT compare to Reinforcement Learning?

M Andersson - 2022 - diva-portal.org
This study examined the relative performance of Deep Reinforcement Learning compared to
a neuroevolution algorithm called NEAT when used to train AIs in a discrete game …

Coordinated DMA: improving the DRAM access efficiency for matrix multiplication

S Ma, Z Liu, S Chen, L Huang, Y Guo… - … on Parallel and …, 2019 - ieeexplore.ieee.org
High performance implementation of matrix multiplication is essential for scientific
computing. The memory access procedure is quite possible to be the bottleneck of matrix …

Optimizing General Matrix Multiplications on Modern Multi-core DSPs

K Yu, X Qi, P Zhang, J Fang, D Dong… - 2024 IEEE …, 2024 - ieeexplore.ieee.org
General Matrix Multiplication (GEMM) is a key subprogram in high-performance computing
(HPC) and deep learning workloads. With the rising significance of power and energy …