Al Geist: Too Big for MPI?
Abstract.
In 2008 the National Leadership Computing Facility at Oak Ridge National Laboratory will have a petaflop system in place. This system will have tens of thousands of processors and petabytes of memory. This capability system will focus on application problems that are so hard that they require weeks on the full system to achieve breakthrough science in nanotechnology, medicine, and energy. With long running jobs on such huge computing systems the question arises: Are the computers and applications getting too big for MPI? This talk will address several reasons why the answer to this question may be yes.
The first reason is the growing need for fault tolerance. This talk will review the recent efforts in adding fault
tolerance to MPI and the broader need for holistic fault tolerance across petascale machines. The second reason is the potential need by these applications for new features or capabilities that don’t exist in the MPI standard. A third reason is the emergence of new languages and programming paradigms on the horizon.
This talk will discuss the DARPA High Productivity Computing Systems project and the new languages, Fortress, Chapel, Fortress, and X10 being developed by Cray, Sun, and IBM respectively.
About the speaker.
Al Geist is a Corporate Research Fellow at Oak Ridge National Laboratory (ORNL),where he leads the 35 member Computer Science Research Group. He is one of the original developers of PVM (Parallel Virtual Machine), which became a world-wide de facto standard for heterogeneous distributed computing. Al was actively involved in both the MPI-1 and MPI-2 design teams and more recently the development of FT-MPI, a fault tolerant MPI implementation. Today he leads a national Scalable Systems Software effort, involving all the DOE and NSF supercomputer sites, with the goal of defining standardized interfaces between system software
components. Al is co-PI of a national Genomes to Life (GTL) center. The goal of his GTL center is to develop new algorithms, and computational infrastructure for understanding protein machines and regulatory pathways in cells. He heads up a project developing self-adapting, fault tolerant algorithms for 100,000 processor systems.
In his 20 years at ORNL, he has published two books and over 190 papers in areas ranging from heterogeneous distributed computing, numerical linear algebra, parallel computing, collaboration technologies, solar energy, materials science, biology, and solid state physics. You can find out more at Al’s web site.