Google Academic

J Filipovič, M Madzin, J Fousek, L Matyska - The Journal of …, 2015 - Springer

Contemporary GPUs have significantly higher arithmetic throughput than a memory
throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic …

Salvați Citați Citat de 106 ori Articole cu conținut similar Toate cele 14 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] psu.edu

Automating the generation of composed linear algebra kernels

G Belter, ER Jessup, I Karlin, JG Siek - Proceedings of the Conference …, 2009 - dl.acm.org

Memory bandwidth limits the performance of important kernels in many scientific
applications. Such applications often use sequences of Basic Linear Algebra Subprograms …

Salvați Citați Citat de 93 ori Articole cu conținut similar Toate cele 10 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] googleapis.com

Pipelined approach to fused kernels for optimization of machine learning workloads on graphical processing units

A Ashari, M Boehm, KW Campbell… - US Patent …, 2018 - Google Patents

(57) ABSTRACT A method for optimization of machine learning (ML) work loads on a
graphics processor unit (GPU). The method includes identifying a computation having a …

Salvați Citați Citat de 37 ori Articole cu conținut similar Toate cele 4 versiuni În cache

[Free GPT-4]
[DeepSeek]

[PDF] kaust.edu.sa

[PDF][PDF] Programming abstractions for data locality

A Tate, A Kamil, A Dubey, A Groblinger, B Chamberlain… - 2014 - repository.kaust.edu.sa

Programming Abstractions for Data Locality Page 1 Programming Abstractions for Data Locality
Item Type Technical Report Authors Tate, Adrian;Kamil, Amir;Dubey, Anshu;Groblinger …

Salvați Citați Citat de 52 ori Articole cu conținut similar Toate cele 10 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] academia.edu

Build to order linear algebra kernels

JG Siek, I Karlin, ER Jessup - 2008 IEEE International …, 2008 - ieeexplore.ieee.org

The performance bottleneck for many scientific applications is the cost of memory access
inside linear algebra kernels. Tuning such kernels for memory efficiency is a complex task …

Salvați Citați Citat de 70 ori Articole cu conținut similar Toate cele 6 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] acm.org Full View

Design and implementation for nonblocking execution in GraphBLAS: Tradeoffs and performance

A Mastoras, S Anagnostidis, AJN Yzelman - ACM Transactions on …, 2022 - dl.acm.org

GraphBLASis a recent standard that allows the expression of graph algorithms in the
language of linear algebra and enables automatic code parallelization and optimization …

Salvați Citați Citat de 7 ori Articole cu conținut similar Toate cele 4 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] udc.es

Exploiting heterogeneous parallelism with the Heterogeneous Programming Library

M Viñas, Z Bozkus, BB Fraguela - Journal of Parallel and Distributed …, 2013 - Elsevier

While recognition of the advantages of heterogeneous computing is steadily growing, the
issues of programmability and portability hinder its exploitation. The introduction of the …

Salvați Citați Citat de 45 ori Articole cu conținut similar Toate cele 11 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] lri.fr

The numerical template toolbox: A modern c++ design for scientific computing

P Esterie, J Falcou, M Gaunard, JT Lapresté… - Journal of Parallel and …, 2014 - Elsevier

The design and implementation of high level tools for parallel programming is a major
challenge as the complexity of modern architectures increases. Domain Specific Languages …

Salvați Citați Citat de 31 ori Articole cu conținut similar Toate cele 10 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] academia.edu

Optimization techniques for efficient HTA programs

BB Fraguela, G Bikshandi, J Guo, MJ Garzarán… - Parallel Computing, 2012 - Elsevier

Object oriented languages can be easily extended with new data types, which facilitate
prototy** new language extensions. A very challenging problem is the development of …

Salvați Citați Citat de 28 ori Articole cu conținut similar Toate cele 10 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

FlashR: parallelize and scale R for machine learning using SSDs

D Zheng, D Mhembere, JT Vogelstein… - Proceedings of the 23rd …, 2018 - dl.acm.org

R is one of the most popular programming languages for statistics and machine learning, but
it is slow and unable to scale to large datasets. The general approach for having an efficient …

Salvați Citați Citat de 5 ori Articole cu conținut similar Toate cele 7 versiuni

Creează alerta

Citați

Căutare avansată

Salvat în Bibliotecă

DESOLA: An active linear algebra library using delayed evaluation and runtime code generation

Optimizing CUDA code by kernel fusion: application on BLAS

Automating the generation of composed linear algebra kernels

Pipelined approach to fused kernels for optimization of machine learning workloads on graphical processing units

[PDF][PDF] Programming abstractions for data locality

Build to order linear algebra kernels

Design and implementation for nonblocking execution in GraphBLAS: Tradeoffs and performance

Exploiting heterogeneous parallelism with the Heterogeneous Programming Library

The numerical template toolbox: A modern c++ design for scientific computing

Optimization techniques for efficient HTA programs

FlashR: parallelize and scale R for machine learning using SSDs