The design of OpenMP tasks

E Ayguadé, N Copty, A Duran… - … on Parallel and …, 2008 - ieeexplore.ieee.org
OpenMP has been very successful in exploiting structured parallelism in applications. With
increasing application complexity, there is a growing need for addressing irregular …

Supermatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

E Chan, FG Van Zee, P Bientinesi… - Proceedings of the 13th …, 2008 - dl.acm.org
This paper describes SuperMatrix, a runtime system that parallelizes matrix operations for
SMP and/or multi-core architectures. We use this system to demonstrate how code …

A proposal for task parallelism in OpenMP

E Ayguadé, N Copty, A Duran, J Hoeflinger… - … Workshop on OpenMP, 2007 - Springer
This paper presents a novel proposal to define task parallelism in OpenMP. Task parallelism
has been lacking in the OpenMP language for a number of years already. As we show, this …

An experimental evaluation of the new OpenMP tasking model

E Ayguadé, A Duran, J Hoeflinger, F Massaioli… - … on Languages and …, 2007 - Springer
The OpenMP standard was conceived to parallelize dense array-based applications, and it
has achieved much success with that. Recently, a novel tasking proposal to handle …

Rank-Polymorphism for Shape-Guided Blocking

A Šinkarovs, T Koopman, SB Scholz - Proceedings of the 11th ACM …, 2023 - dl.acm.org
Many numerical algorithms on matrices or tensors can be formulated in a blocking style
which improves performance due to better cache locality. In imperative languages, blocking …

Scaling LAPACK panel operations using parallel cache assignment

AM Castaldo, RC Whaley - ACM Sigplan Notices, 2010 - dl.acm.org
In LAPACK many matrix operations are cast as block algorithms which iteratively process a
panel using an unblocked algorithm and then update a remainder matrix using the high …

Toward scalable matrix multiply on multithreaded architectures

B Marker, FG Van Zee, K Goto, G Quintana-Ortí… - Euro-Par 2007 Parallel …, 2007 - Springer
We show empirically that some of the issues that affected the design of linear algebra
libraries for distributed memory architectures will also likely affect such libraries for shared …

[KNJIGA][B] Library generation for linear transforms

Y Voronenko - 2008 - search.proquest.com
The development of high-performance numeric libraries has become extraordinarily difficult
due to multiple processor cores, vector instruction sets, and deep memory hierarchies. To …

Scaling LAPACK panel operations using parallel cache assignment

AM Castaldo, RC Whaley, S Samuel - ACM Transactions on …, 2013 - dl.acm.org
In LAPACK many matrix operations are cast as block algorithms which iteratively process a
panel using an unblocked algorithm and then update a remainder matrix using the high …

[PDF][PDF] A DAG-based parallel Cholesky factorization for multicore systems

JD Hogg - Technical Report RAL-TR-2008-029, Rutherford …, 2008 - researchgate.net
Modern processors have multiple cores, making multiprocessing essential for competitive
desktop linear algebra. Asynchronous processing with much inherent parallelism can be …