Open MPI: A high-performance, heterogeneous MPI

RL Graham, GM Shipman, BW Barrett… - 2006 IEEE …, 2006 - ieeexplore.ieee.org
The growth in the number of generally available, distributed, heterogeneous computing
systems places increasing importance on the development of user-friendly tools that enable …

Algorithm-based fault tolerance for fail-stop failures

Z Chen, J Dongarra - IEEE Transactions on Parallel and …, 2008 - ieeexplore.ieee.org
Fail-stop failures in distributed environments are often tolerated by checkpointing or
message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix …

Optimal controllers for hybrid systems: Stability and piecewise linear explicit form

A Bemporad, F Borrelli, M Morari - Proceedings of the 39th …, 2000 - ieeexplore.ieee.org
We propose a procedure for synthesizing piecewise linear optimal controllers for hybrid
systems and investigate conditions for closed-loop stability. Hybrid systems are modeled in …

Fault tolerant high performance computing by a coding approach

Z Chen, GE Fagg, E Gabriel, J Langou… - Proceedings of the …, 2005 - dl.acm.org
As the number of processors in today's high performance computers continues to grow, the
mean-time-to-failure of these computers are becoming significantly shorter than the …

MPI on millions of cores

P Balaji, D Buntinas, D Goodell, W Gropp… - Parallel Processing …, 2011 - World Scientific
Petascale parallel computers with more than a million processing cores are expected to be
available in a couple of years. Although MPI is the dominant programming interface today for …

VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance

L Ramakrishnan, C Koelbel, YS Kee, R Wolski… - Proceedings of the …, 2009 - dl.acm.org
Today's scientific workflows use distributed heterogeneous resources through diverse grid
and cloud interfaces that are often hard to program. In addition, especially for time-sensitive …

Fault tolerance and recovery of scientific workflows on computational grids

G Kandaswamy, A Mandal… - 2008 Eighth IEEE …, 2008 - ieeexplore.ieee.org
In this paper, we describe the design and implementation of two mechanisms for fault-
tolerance and recovery for complex scientific workflows on computational grids. We present …

Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

Z Chen, J Dongarra - Proceedings 20th IEEE International …, 2006 - ieeexplore.ieee.org
As the size of today's high performance computers increases from hundreds, to thousands,
and even tens of thousands of processors, node failures in these computers are becoming …

Abstractions and middleware for petascale computing and beyond

IF Sbalzarini - … Integration Advancements in Distributed Systems and …, 2012 - igi-global.com
As high-performance computing moves to the petascale and beyond, a number of
algorithmic and software challenges need to be addressed. This paper reviews the main …

Adaptive simulation of soft bodies in real-time

G Debunne, M Desbrun, MP Cani… - Proceedings computer …, 2000 - ieeexplore.ieee.org
This paper presents an adaptive technique to animate deformable bodies in real-time. Our
method relies on mixed finite-volume/finite-element method applied to an arbitrary non …