Analytical modeling is enough for high-performance BLIS

TM Low, FD Igual, TM Smith… - ACM Transactions on …, 2016 - dl.acm.org
We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides
a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation …

Parallel Deep Learning with a hybrid BP-PSO framework for feature extraction and malware classification

MN Al-Andoli, SC Tan, KS Sim, CP Lim, PY Goh - Applied Soft Computing, 2022 - Elsevier
Malicious software (Malware) is a key threat to security of digital networks and systems.
While traditional machine learning methods have been widely used for malware detection …

An ensemble-based parallel deep learning classifier with PSO-BP optimization for malware detection

MN Al-Andoli, KS Sim, SC Tan, PY Goh, CP Lim - IEEE Access, 2023 - ieeexplore.ieee.org
Digital networks and systems are susceptible to malicious software (malware) attacks. Deep
learning (DL) models have recently emerged as effective methods to classify and detect …

A methodology for efficient tile size selection for affine loop kernels

V Kelefouras, K Djemame, G Keramidas… - International Journal of …, 2022 - Springer
Reducing the number of data accesses in memory hierarchy is of paramount importance on
modern computer systems. One of the key optimizations addressing this problem is loop …

An approach for matrix multiplication of 32-bit fixed point numbers by means of 16-bit SIMD instructions on DSP

I Safonov, A Kornilov, D Makienko - Electronics, 2022 - mdpi.com
Matrix multiplication is an important operation for many engineering applications.
Sometimes new features that include matrix multiplication should be added to existing and …

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

H Kang, HC Kwon, D Kim - Computing, 2020 - Springer
We present a novel heterogeneous parallel matrix multiplication algorithm that utilizes both
central processing units (CPUs) and graphics processing units (GPUs) for large-scale …

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

V Kelefouras, A Kritikakou, I Mporas… - The Journal of …, 2016 - Springer
Current compilers cannot generate code that can compete with hand-tuned code in
efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in …

Automatic generation of fast BLAS3-GEMM: A portable compiler approach

X Su, X Liao, J Xue - 2017 IEEE/ACM International Symposium …, 2017 - ieeexplore.ieee.org
GEMM is the main computational kernel in BLAS3. Its micro-kernel is either hand-crafted in
assembly code or generated from C code by general-purpose compilers (guided by …

Performance evaluation of implicit and explicit SIMDization

H Amiri, A Shahbahrami, A Pohl, B Juurlink - Microprocessors and …, 2018 - Elsevier
Processor vendors have been expanding Single Instruction Multiple Data (SIMD) extensions
to exploit data-level-parallelism in their General Purpose Processors (GPPs). Each SIMD …

Design and implementation of a highly efficient dgemm for 64-bit armv8 multi-core processors

F Wang, H Jiang, K Zuo, X Su, J Xue… - 2015 44th International …, 2015 - ieeexplore.ieee.org
This paper presents the design and implementation of a highly efficient Double-precision
General Matrix Multiplication (DGEMM) based on Open BLAS for 64-bit ARMv8 eight-core …