Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction
RL Graham, D Bureddy, P Lui… - … in HPC (COMHPC), 2016 - ieeexplore.ieee.org
Increased system size and a greater reliance on utilizing system parallelism to achieve
computational needs, requires innovative system architectures to meet the simulation …
computational needs, requires innovative system architectures to meet the simulation …
An evaluation of the CORAL interconnects
The US Department of Energy deployed the Summit and Sierra supercomputers with the
latest state-of-the-art network interconnect technology in 2018 and both systems entered …
latest state-of-the-art network interconnect technology in 2018 and both systems entered …
Hierarchical redesign of classic MPI reduction algorithms
Optimization of MPI collective communication operations has been an active research topic
since the advent of MPI in 1990s. Many general and architecture-specific collective …
since the advent of MPI in 1990s. Many general and architecture-specific collective …
Efficient process arrival pattern aware collective communication for deep learning
MPI collective communication operations are used extensively in parallel applications. As
such, researchers have been investigating how to improve their performance and scalability …
such, researchers have been investigating how to improve their performance and scalability …
Tascade: Hardware support for atomic-free, asynchronous and efficient reduction trees
Graph search and sparse data-structure traversal workloads contain challenging irregular
memory patterns on global data structures that need to be modified atomically. Distributed …
memory patterns on global data structures that need to be modified atomically. Distributed …
Energy-efficient collective reduce and allreduce operations on distributed GPUs
GPUs gain high popularity in High Performance Computing, due to their massive parallelism
and high performance per Watt. Despite their popularity, data transfer between multiple …
and high performance per Watt. Despite their popularity, data transfer between multiple …
Unified collective communication (ucc): An unified library for cpu, gpu, and dpu collectives
MG Venkata, V Petrov, S Lebedev… - … IEEE Symposium on …, 2024 - ieeexplore.ieee.org
Unified Collective Communication (UCC) is an API and library implementation of collective
communication operations. The goal of UCC is to provide a unified API and library serving …
communication operations. The goal of UCC is to provide a unified API and library serving …
Designing a Parallel Programs on the Base of the Conception of Q-Determinant
V Aleeva - … : 4th Russian Supercomputing Days, RuSCDays 2018 …, 2019 - Springer
The paper describes a design method of parallel programs for numerical algorithms based
on their representation in the form of Q-determinant. The result of the method is Q-effective …
on their representation in the form of Q-determinant. The result of the method is Q-effective …
High-Performance Computing Using Application of Q-determinant of Numerical Algorithms
The conception of Q-determinant is one of the approaches to parallelizing numerical
algorithms. The basic notion of the conception is Q-determinant of the algorithm. Here Q is …
algorithms. The basic notion of the conception is Q-determinant of the algorithm. Here Q is …
Unified Collective Communication (UCC): A Unified Library for CPU, GPU, and DPU Collectives
M GorentlaVenkata, V Petrov, S Lebedev… - IEEE Micro, 2025 - ieeexplore.ieee.org
Unified Collective Communication (UCC) is an API and library implementation of collective
communication operations. The goal of UCC is to provide a unified API and library serving …
communication operations. The goal of UCC is to provide a unified API and library serving …