Richard L. Graham: Approaches For Parallel Applications Fault Tolerance
Abstract.
As the complexity of high performance computer systems increases or the level of end-to-end engineering integration decreases, the likelihood of software or hardware failure increases. It is, therefore, important to effectively deal with these failures in order to maintain application mean-time-to-failure at levels acceptable to users.This talk will describe the work done in the Open MPI collaboration to recover from several failure scenarios. This builds on research already done in the the context of the LA-MPI, FT-MPI, LAM/MPI, and PACX-MPI projects and deals with transient and catastrophic network errors, as well as several approaches to handling process failure. It will address how failures are detected, the mechanisms used to work around these failures and allow the applications to continue running, and what level of support, if any, is needed from the application to successfully deploy these solutions. In addition, the performance impact of these solutions on several applications will be discussed.
About the speaker.
Richard Graham is the Computer Systems and Software Environment (ASC) Program manager, and the Advanced Computing Laboratory acting group leader at the Los Alamos National Laboratory. He joined LANL’s Advanced Computing Laboratory (ACL) as a technical staff member in 1999. As team leader for the Resilient Technologies Team he started the LA-MPI project, and is one of the founders of the Open MPI collaboration. Prior to joining the ACL, he spent seven years working at Cray Research and SGI.Rich obtained his PhD in Theoretical Chemistry from Texas A&M University in 1990 and did post-doctoral work at the James Franck Institute of the University of Chicago. His BS in chemistry was from Seattle Pacific University.Platinum Sponsors