Fault tolerance of MPI applications in exascale systems: The ULFM solution

N Losada, P González, MJ Martín, G Bosilca… - Future Generation …, 2020 - Elsevier
The growth in the number of computational resources used by high-performance computing
(HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become …

A survey on malleability solutions for high-performance distributed computing

JI Aliaga, M Castillo, S Iserte, I Martín-Álvarez… - Applied Sciences, 2022 - mdpi.com
Maintaining a high rate of productivity, in terms of completed jobs per unit of time, in High-
Performance Computing (HPC) facilities is a cornerstone in the next generation of exascale …

DMRlib: easy-coding and efficient resource management for job malleability

S Iserte, R Mayo, ES Quintana-Ortí… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Process malleability has proved to have a highly positive impact on the resource utilization
and global productivity in data centers compared with the conventional static resource …

Hybrid workload scheduling on HPC systems

Y Fan, Z Lan, P Rich, W Allcock… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org
Traditionally, on-demand, rigid, and malleable applications have been scheduled and
executed on separate systems. The ever-growing workload demands and rapidly …

Dynamic spawning of MPI processes applied to malleability

I Martín-Álvarez, JI Aliaga, M Castillo… - … Journal of High …, 2024 - journals.sagepub.com
Malleability allows computing facilities to adapt their workloads through resource
management systems to maximize the throughput of the facility and the efficiency of the …

Adaptive parallel applications: from shared memory architectures to fog computing (2002–2022)

G Galante, R da Rosa Righi - Cluster Computing, 2022 - Springer
The evolution of parallel architectures points to dynamic environments where the number of
available resources or configurations may vary during the execution of applications. This …

Transparent resource elasticity for task-based cluster environments with work stealing

J Posner, C Fohry - 50th International Conference on Parallel …, 2021 - dl.acm.org
Resource elasticity allows to dynamically change the resources of running jobs, which may
significantly improve the throughput on supercomputers. Elasticity requires support from …

DMR API: Improving cluster productivity by turning applications into malleable

S Iserte, R Mayo, ES Quintana-Ortí, V Beltran… - Parallel Computing, 2018 - Elsevier
Adaptive workloads can change on–the–fly the configuration of their jobs, in terms of
number of processes. To carry out these job reconfigurations, we have designed a …

Efficient scalable computing through flexible applications and adaptive workloads

S Iserte, R Mayo, ES Quintana-Ortí… - 2017 46th …, 2017 - ieeexplore.ieee.org
In this paper we introduce a methodology for dynamic job reconfiguration driven by the
programming model runtime in collaboration with the global resource manager. We improve …

An study of the effect of process malleability in the energy efficiency on GPU-based clusters

S Iserte, K Rojek - The Journal of Supercomputing, 2020 - Springer
The adoption of graphic processor units (GPU) in high-performance computing (HPC)
infrastructures determines, in many cases, the energy consumption of those facilities. For …