Optimizing CUDA code by kernel fusion: application on BLAS

J Filipovič, M Madzin, J Fousek, L Matyska - The Journal of …, 2015 - Springer
Contemporary GPUs have significantly higher arithmetic throughput than a memory
throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic …

Automating the generation of composed linear algebra kernels

G Belter, ER Jessup, I Karlin, JG Siek - Proceedings of the Conference …, 2009 - dl.acm.org
Memory bandwidth limits the performance of important kernels in many scientific
applications. Such applications often use sequences of Basic Linear Algebra Subprograms …

Pipelined approach to fused kernels for optimization of machine learning workloads on graphical processing units

A Ashari, M Boehm, KW Campbell… - US Patent …, 2018 - Google Patents
(57) ABSTRACT A method for optimization of machine learning (ML) work loads on a
graphics processor unit (GPU). The method includes identifying a computation having a …

[PDF][PDF] Programming abstractions for data locality

A Tate, A Kamil, A Dubey, A Groblinger, B Chamberlain… - 2014 - repository.kaust.edu.sa
Programming Abstractions for Data Locality Page 1 Programming Abstractions for Data Locality
Item Type Technical Report Authors Tate, Adrian;Kamil, Amir;Dubey, Anshu;Groblinger …

Build to order linear algebra kernels

JG Siek, I Karlin, ER Jessup - 2008 IEEE International …, 2008 - ieeexplore.ieee.org
The performance bottleneck for many scientific applications is the cost of memory access
inside linear algebra kernels. Tuning such kernels for memory efficiency is a complex task …

Design and implementation for nonblocking execution in GraphBLAS: Tradeoffs and performance

A Mastoras, S Anagnostidis, AJN Yzelman - ACM Transactions on …, 2022 - dl.acm.org
GraphBLASis a recent standard that allows the expression of graph algorithms in the
language of linear algebra and enables automatic code parallelization and optimization …

Exploiting heterogeneous parallelism with the Heterogeneous Programming Library

M Viñas, Z Bozkus, BB Fraguela - Journal of Parallel and Distributed …, 2013 - Elsevier
While recognition of the advantages of heterogeneous computing is steadily growing, the
issues of programmability and portability hinder its exploitation. The introduction of the …

The numerical template toolbox: A modern c++ design for scientific computing

P Esterie, J Falcou, M Gaunard, JT Lapresté… - Journal of Parallel and …, 2014 - Elsevier
The design and implementation of high level tools for parallel programming is a major
challenge as the complexity of modern architectures increases. Domain Specific Languages …

Optimization techniques for efficient HTA programs

BB Fraguela, G Bikshandi, J Guo, MJ Garzarán… - Parallel Computing, 2012 - Elsevier
Object oriented languages can be easily extended with new data types, which facilitate
prototy** new language extensions. A very challenging problem is the development of …

FlashR: parallelize and scale R for machine learning using SSDs

D Zheng, D Mhembere, JT Vogelstein… - Proceedings of the 23rd …, 2018 - dl.acm.org
R is one of the most popular programming languages for statistics and machine learning, but
it is slow and unable to scale to large datasets. The general approach for having an efficient …