Analytical modeling is enough for high-performance BLIS
We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides
a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation …
a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation …
Parallel Deep Learning with a hybrid BP-PSO framework for feature extraction and malware classification
Malicious software (Malware) is a key threat to security of digital networks and systems.
While traditional machine learning methods have been widely used for malware detection …
While traditional machine learning methods have been widely used for malware detection …
An ensemble-based parallel deep learning classifier with PSO-BP optimization for malware detection
Digital networks and systems are susceptible to malicious software (malware) attacks. Deep
learning (DL) models have recently emerged as effective methods to classify and detect …
learning (DL) models have recently emerged as effective methods to classify and detect …
A methodology for efficient tile size selection for affine loop kernels
Reducing the number of data accesses in memory hierarchy is of paramount importance on
modern computer systems. One of the key optimizations addressing this problem is loop …
modern computer systems. One of the key optimizations addressing this problem is loop …
An approach for matrix multiplication of 32-bit fixed point numbers by means of 16-bit SIMD instructions on DSP
I Safonov, A Kornilov, D Makienko - Electronics, 2022 - mdpi.com
Matrix multiplication is an important operation for many engineering applications.
Sometimes new features that include matrix multiplication should be added to existing and …
Sometimes new features that include matrix multiplication should be added to existing and …
HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs
We present a novel heterogeneous parallel matrix multiplication algorithm that utilizes both
central processing units (CPUs) and graphics processing units (GPUs) for large-scale …
central processing units (CPUs) and graphics processing units (GPUs) for large-scale …
A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures
Current compilers cannot generate code that can compete with hand-tuned code in
efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in …
efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in …
Automatic generation of fast BLAS3-GEMM: A portable compiler approach
X Su, X Liao, J Xue - 2017 IEEE/ACM International Symposium …, 2017 - ieeexplore.ieee.org
GEMM is the main computational kernel in BLAS3. Its micro-kernel is either hand-crafted in
assembly code or generated from C code by general-purpose compilers (guided by …
assembly code or generated from C code by general-purpose compilers (guided by …
Performance evaluation of implicit and explicit SIMDization
Processor vendors have been expanding Single Instruction Multiple Data (SIMD) extensions
to exploit data-level-parallelism in their General Purpose Processors (GPPs). Each SIMD …
to exploit data-level-parallelism in their General Purpose Processors (GPPs). Each SIMD …
Design and implementation of a highly efficient dgemm for 64-bit armv8 multi-core processors
This paper presents the design and implementation of a highly efficient Double-precision
General Matrix Multiplication (DGEMM) based on Open BLAS for 64-bit ARMv8 eight-core …
General Matrix Multiplication (DGEMM) based on Open BLAS for 64-bit ARMv8 eight-core …