Evaluating the cost of atomic operations on modern architectures
Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-Add (FAA) are
ubiquitous in parallel programming. Yet, performance tradeoffs between these operations …
ubiquitous in parallel programming. Yet, performance tradeoffs between these operations …
Memory performance of AMD EPYC Rome and Intel Cascade Lake SP server processors
Modern processors, in particular within the server segment, integrate more cores with each
generation. This increases their complexity in general, and that of the memory hierarchy in …
generation. This increases their complexity in general, and that of the memory hierarchy in …
A comparison of binarization methods for historical archive documents
J He, QDM Do, AC Downton… - … Conference on Document …, 2005 - ieeexplore.ieee.org
This paper compares several alternative binarization algorithms for historical archive
documents, by evaluating their effect on end-to-end word recognition performance in a …
documents, by evaluating their effect on end-to-end word recognition performance in a …
Test-driving intel xeon phi
Based on Intel's Many Integrated Core (MIC) architecture, Intel Xeon Phi is one of the few
truly many-core CPUs-featuring around 60 fairly powerful cores, two levels of caches, and …
truly many-core CPUs-featuring around 60 fairly powerful cores, two levels of caches, and …
Capability models for manycore memory systems: A case-study with Xeon Phi KNL
Increasingly complex memory systems and onchip interconnects are developed to mitigate
the data movement bottlenecks in manycore processors. One example of such a complex …
the data movement bottlenecks in manycore processors. One example of such a complex …
A survey of performance modeling and simulation techniques for accelerator-based computing
U Lopez-Novoa, A Mendiburu… - IEEE Transactions on …, 2014 - ieeexplore.ieee.org
The high performance computing landscape is shifting from collections of homogeneous
nodes towards heterogeneous systems, in which nodes consist of a combination of …
nodes towards heterogeneous systems, in which nodes consist of a combination of …
Parallel transposition of sparse data structures
Many applications in computational sciences and social sciences exploit sparsity and
connectivity of acquired data. Even though many parallel sparse primitives such as sparse …
connectivity of acquired data. Even though many parallel sparse primitives such as sparse …
Exploiting locality in sparse matrix-matrix multiplication on many-core architectures
Exploiting spatial and temporal localities is investigated for efficient row-by-row
parallelization of general sparse matrix-matrix multiplication (SpGEMM) operation of the …
parallelization of general sparse matrix-matrix multiplication (SpGEMM) operation of the …
Ultra-scalable CPU-MIC acceleration of mesoscale atmospheric modeling on Tianhe-2
In this work an ultra-scalable algorithm is designed and optimized to accelerate a 3D
compressible Euler atmospheric model on the CPU-MIC hybrid system of Tianhe-2. We first …
compressible Euler atmospheric model on the CPU-MIC hybrid system of Tianhe-2. We first …
Energy, memory, and runtime tradeoffs for implementing collective communication operations
Collective operations are among the most important communication operations in shared-
and distributed-memory parallel applications. In this paper, we analyze the tradeoffs …
and distributed-memory parallel applications. In this paper, we analyze the tradeoffs …