Fault tolerance of MPI applications in exascale systems: The ULFM solution
The growth in the number of computational resources used by high-performance computing
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …
A survey on malleability solutions for high-performance distributed computing
Maintaining a high rate of productivity, in terms of completed jobs per unit of time, in High-
Performance Computing (HPC) facilities is a cornerstone in the next generation of exascale …
Performance Computing (HPC) facilities is a cornerstone in the next generation of exascale …
DMRlib: easy-coding and efficient resource management for job malleability
Process malleability has proved to have a highly positive impact on the resource utilization
and global productivity in data centers compared with the conventional static resource …
and global productivity in data centers compared with the conventional static resource …
Hybrid workload scheduling on HPC systems
Traditionally, on-demand, rigid, and malleable applications have been scheduled and
executed on separate systems. The ever-growing workload demands and rapidly …
executed on separate systems. The ever-growing workload demands and rapidly …
Dynamic spawning of MPI processes applied to malleability
Malleability allows computing facilities to adapt their workloads through resource
management systems to maximize the throughput of the facility and the efficiency of the …
management systems to maximize the throughput of the facility and the efficiency of the …
Adaptive parallel applications: from shared memory architectures to fog computing (2002–2022)
The evolution of parallel architectures points to dynamic environments where the number of
available resources or configurations may vary during the execution of applications. This …
available resources or configurations may vary during the execution of applications. This …
Transparent resource elasticity for task-based cluster environments with work stealing
J Posner, C Fohry - 50th International Conference on Parallel …, 2021 - dl.acm.org
Resource elasticity allows to dynamically change the resources of running jobs, which may
significantly improve the throughput on supercomputers. Elasticity requires support from …
significantly improve the throughput on supercomputers. Elasticity requires support from …
DMR API: Improving cluster productivity by turning applications into malleable
Adaptive workloads can change on–the–fly the configuration of their jobs, in terms of
number of processes. To carry out these job reconfigurations, we have designed a …
number of processes. To carry out these job reconfigurations, we have designed a …
Efficient scalable computing through flexible applications and adaptive workloads
In this paper we introduce a methodology for dynamic job reconfiguration driven by the
programming model runtime in collaboration with the global resource manager. We improve …
programming model runtime in collaboration with the global resource manager. We improve …
An study of the effect of process malleability in the energy efficiency on GPU-based clusters
The adoption of graphic processor units (GPU) in high-performance computing (HPC)
infrastructures determines, in many cases, the energy consumption of those facilities. For …
infrastructures determines, in many cases, the energy consumption of those facilities. For …