Below is a chronological listing of the technical paper sessions with titles, abstracts and slides of the presentations (if provided by the presenting author). The technical papers that have been seleted for the Outstanding Papers session are listed on the respective page.
Time: 18th Sep 2006, 10:00 – 10:50
Automatic Memory Optimizations for Improving MPI Derived Datatype Performance
MPI derived datatypes allow users to describe noncontiguous memory layout and communicate noncontiguous data with a single communication function. This powerful feature provides an MPI implementation to optimize the transfer of noncontiguous data. In practice, however, many implementations of MPI derived datatypes perform poorly, which makes application developers avoid using this feature. In this paper, we present a technique to automatically select templates that are optimized for memory performance based on the access pattern of derived datatypes. We implement this mechanism in the MPICH2 source code. The performance of our implementation is compared to well-written manual packing/unpacking routines and original MPICH2 implementation. We show that performance for various derived datatypes is significantly improved and comparable to that of optimized manual routines.
Improving the Dynamic Creation of Processes in MPI-2
The MPI-2 standard has been implemented for a few years in most of the MPI distributions. Yet, the dynamic creation of processes, enabled by MPI-2, turns it harder to handle their scheduling manually. This paper presents a scheduler module, that has been implemented with MPI-2, that determines, on-line, on which processor a newly spawned process should be run. The scheduler can apply a basic Round-Robin mechanism or use load information to apply a list scheduling policy, for MPI-2 programs with dynamic creation of processes. A rapid presentation of the scheduler is given, followed by experimental evaluations on three test programs: the Fibonacci computation, the N-Queens benchmark and a computation of prime numbers. Even with the basic mechanisms that have been implemented, a clear gain is obtained regarding the run-time, the load balance, and consequently regarding the number of processes that can be run by the MPI program.
Object-Oriented Message Passing
Time: 18th Sep 2006, 10:00 – 10:50
Modernizing the C++ Interface to MPI
MPI is the de facto standard for writing message passing applications. Much of MPI’s power stems from its ability to provide a high-performance, consistent interface across C, Fortran, and C++. Unfortunately, with cross-language consistency at the forefront, MPI supports only the lowest common denominator of the three languages, providing a level of abstraction far lower than what C++ programmers expect. For instance, MPI does not inherently support C++ containers and iterators, nor does it provide seamless support for user-defined classes. To map these common C++ constructs into MPI, programmers must often write non-trivial boiler-plate code and weaken the type-safety guarantees provided by C++. This paper describes several ideas for modernizing the C++ interface to MPI, providing a more natural syntax along with seamless support for user-defined types and C++ Standard Library constructs and sketch the C++ techniques required to realize this interface without sacrificing performance.
Limitations and Extensions
Time: 18th Sep 2006, 11:20 – 12:35
Can MPI Be Used for Persistent Parallel Services?
MPI is routinely used for writing parallel applications, but it is not commonly used for writing long-running parallel services, such as parallel file systems or job schedulers. Nonetheless, MPI does have many features that are potentially useful for writing such software. Using the PVFS2 parallel file system as a motivating example, we studied the needs of software that provide persistent parallel services and evaluated whether MPI is a good match for those needs. We also ran experiments to determine the gaps between what the MPI Standard enables and what MPI implementations currently support. The results of our study indicate that MPI can enable persistent parallel systems to be developed with less effort and can provide high performance, but MPI implementations will need to provide better support for certain features. We also describe an area where additions to the MPI Standard would be useful.
Observations on MPI-2 Support for Hybrid Master/Slave Applications in Dynamic and Heterogeneous Environments
Large-scale MPI programs must work with dynamic and heterogeneous resources. While many of the involved issues can be handled by the MPI implementation, some require support at the application level. This paper considers a master/slave application, in which MPI processes internally use a different number of threads created by OpenMP. We modify the standard master/slave pattern to allow for dynamic addition and withdrawal of slaves. Moreover, the application dynamically adapts to use processors for either processes or threads. The paper evaluates the support that MPI-2 provides for implementing the scheme, partly referring to experiments with the MPICH2 implementation. We found that most requirements can be met if optional parts of the standard are used, but slave crashes require additional functionality.
What MPI could (and cannot) do for Mesh- partitioning on Non-homogeneous Networks
We discuss the mesh-partitioning load-balancing problem for non-homogeneous communication systems, and investigate whether the MPI process topology functionality can aid in solving the problem. An example kernel shows that specific communication patterns can benefit substantially from a non-trivial MPI topology implementation, achieving improvements beyond a factor of five for certain system configurations. However, the topology functionality also lacks certain expressivity to deal effectively with the mesh-partitioning problem. A mild extension to MPI is suggested, which, however, still cannot exclude possibly sub-optimal partitioning results. Solving instead the mesh-partitioning problem completely outside of MPI requires knowledge of the communication system. We discuss ways in which such could be provided by MPI in a portable way. Finally, we formulate and discuss a more general affinity scheduling problem .
Time: 18th Sep 2006, 11:20 – 12:35
Scalable Parallel Trace-Based Performance Analysis
Automatic trace analysis is an effective method for identifying complex performance phenomena in parallel applications. However, as the size of parallel systems grows further and the number of processors used by individual applications is continuously raised, the traditional approach of analyzing a single global trace file, as done by KOJAK’s EXPERT trace analyzer, becomes increasingly constrained by the large number of events. In this article, we present a scalable version of the EXPERT analysis based on analyzing separate local trace files with a parallel tool which ‘replays’ the target application’s communication behavior. We describe the new parallel analyzer architecture and discuss first empirical results.
TAUg: Runtime Global Performance Data Access Using pvmmpi06
To enable a scalable parallel application to view its global performance state, we designed and developed TAUg , a portable runtime framework layered on the TAU parallel performance system. TAUg leverages the MPI library to communicate between application processes, creating an abstraction of a global performance space from which profile views can be retrieved. We describe the TAUg design and implementation and show its use on two test benchmarks up to 512 processors. Overhead evaluation for the use of TAUg is included in our analysis. Future directions for improvement are discussed. Keywords: parallel, performance, runtime, MPI, measurement.
Tracing the MPI-IO Calls’ Disk Accesses
With parallel file I/O we are faced with the situation that we do not have appropriate tools to get an insight into the I/O server behavior depending on the I/O calls in the corresponding parallel MPI program. We present an approach that allows us to also get event traces from the I/O server environment and to merge them with the client trace. Corresponding events will be matched and visualized. We integrate this functionality into the parallel file system PVFS2 and the MPICH2 tool Jumpshot.
Debugging and Verification
Time: 18th Sep 2006, 14:50 – 16:05
An Interface to Support the Identification of Dynamic MPI 2 Processes for Scalable Parallel Debugging
This paper proposes an interface that will allow MPI 2 dynamic programs, those using MPI SPAWN, CONNECT/ACCEPT or JOIN, to provide information to parallel debuggers such as TotalView about the set of processes that constitute an individual application. The TotalView parallel debugger currently obtains information about the identify the processes directly from the MPI library using a widely accepted proctable interface. The existing interface does not support MPI 2 dynamic operations. The proposed interface supports MPI 2 dynamic operations, subset debugging, and helps the parallel debugger assign meaningful names to processes.
Correctness Checking of MPI One-sided Communication Using Marmot
The MPI-2 standard defines functions for Remote Memory Access (RMA) by allowing one process to specify all communication parameters both for the sending and the receiving side, which is also referred to as one-sided communication. Having experienced parallel programming as a complex and error-prone task, we have developed the MPI correctness checking tool MARMOT covering the MPI-1.2 standard and are now aiming at extending it to support application developers also for the more frequently used parts of MPI-2 such as one-sided communication. In this paper we describe our tool, which is designed to check the correct usage of the MPI API automatically at run-time, and we also analyse to what extent it is possible to do so for RMA.
Modeling and Verification of MPI Based Distributed Software
Communication between processes in distributed environment is implemented using either shared memory or message passing paradigm. The message passing paradigm is used more often due to the lesser hardware requirements. MPI is a standardized message passing API with several independent implementations. Specification and verification of distributed systems is generally a challenging task. In this paper we present a case study of specification and verification of MPI based software using abstract state machines (ASMs).
Time: 18th Sep 2006, 14:50 – 16:05
High Performance RDMA Protocols in HPC
Modern network interconnects that use RDMA may offer significant performance advantages over conventional send/recv network semantics. However, this high performance often comes with hidden costs such as requiring an exchange of information prior to an RDMA operation and requiring initiator/target to preserve physical to virtual memory mappings during the RDMA. This paper describes a unique MPI library Â$(Api(Bpeline’ protocol that addresses these constraints while avoiding the pitfalls of existing techniques. By effectively overlapping registration with RDMA operations this protocol provides good performance for any memory usage pattern. This approach avoids the use of non-portable memory hooks or not returning pages to the OS. Through this approach, bandwidth may be increased up to 67% when buffers are not reused while providing superior performance in the effective bandwidth benchmark. Several user level protocols are explored using Open MPI and compared to this Bpeline’ protocol.
Implementation and Shared-Memory Evaluation of MPICH2 over the Nemesis Communication Subsystem
This paper presents the implementation of MPICH2 over the Nemesis communication subsystem and the evaluation of its shared-memory performance. We describe design issues as well as some of the optimization techniques we employed. We conducted a performance evaluation over shared memory using microbenchmarks as well as application benchmarks. The evaluation shows that MPICH2 Nemesis has very low communication overhead, making it suitable for smaller-grained applications.
MPI/CTP: A Reconfigurable MPI for HPC Applications
Modern MPI applications have diverse communication requirements, with trends showing that they are moving from static communication requirements to more dynamic and evolving communication requirements. However, MPI libraries, which integrate MPI applications with the hardware, are not flexible enough to accommodate these diverse needs. This lack of flexibility leads to degraded performance of the applications. In this paper, we present the design of a protocol development framework and an MPI library implemented using our proposed framework that support compile-time and boot-time protocol configuration, as well as runtime protocol reconfiguration based on dynamic application requirements. Experimental results on the initial prototype of this design show that this prototype is able to dynamically reconfigure at runtime to optimize bandwidth under changing MPI requirements.
Fault Tolerance I
Time: 19th Sep 2006, 14:50 – 15:40
FT-MPI, Fault Tolerant Metacomputing and Generic Name Services : a Case Study
There is a growing interest in deploying MPI over very large numbers of heterogenous, geographically distributed resources. FT-MPI provides the fault-tolerance necessary at this scale, but presents some issues when crossing multiple administrative domains. Using the H2O metacomputing framework, we add cross-administrative domain interoperability and “pluggability” to FT-MPI. The latter feature allows us, using proxies, to transparently replace one vulnerable module – its name service with fault-tolerant replacements. We present an algorithm for improving performance of operations over the proxies. We evaluate its performance in a comparison using the original name service, OpenLDAP and current Emory research project HDNS.
Scalable Fault Tolerant Protocol for Parallel Runtime Environments
The number of processors embedded on high performance computing platforms is growing daily to satisfy users desire for solving larger and more complex problems. Parallel runtime environments have to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic environments. This paper presents the design of a scalable and fault tolerant protocol for supporting parallel runtime environment communications. The protocol is designed to support transmission of messages across multiple nodes with in a self-healing topology to protect against recursive node and process failures. A formal protocol verification has validated the protocol for both the normal and failure cases. We have implemented multiple routing algorithms for the protocol and concluded that the variant rule-based routing algorithm yields the best overall results for damaged and incomplete topologies.
Parallel I/O I
Time: 19th Sep 2006, 14:50 – 15:40
Exploiting Shared Memory to Improve Parallel I/O Performance
We explore several methods utilizing system-wide shared memory to improve the performance of MPI-IO, particularly for non-contiguous file access. We introduce an abstraction called the datatype iterator that permits efficient, dynamic generation of (offset, length) pairs for a given MPI derived datatype. Combining datatype iterators with overlapped I/O and computation, we demonstrate how a shared memory MPI implementation can utilize more than 90% of the available disk bandwidth (in some cases representing a 5x performance improvement over existing methods) even for extreme cases of non-contiguous datatypes. We generalize our results to suggest possible parallel I/O performance improvements on systems without global shared memory.
Self-Adaptive Hints for Collective I/O
The processing of MPI-IO operations can be controlled via the MPI API using file hints, which are passed to the MPI library as MPI info objects. A file hint can affect how the MPI library accesses the file on the file system level, it can set buffer sizes, turn special optimizations on and off or whatever parameters the MPI implementation provides. However, experience shows that file hints are rarely used for reasons that will be discussed in the paper.We present a new approach which dynamically determines the optimal setting for file hints related to collective MPI-IO operations. The chosen settings adapt to the actual file access pattern, the topology of the MPI processes and the available memory resources and consider the characteristics of the underlying file system. We evaluate our approach which has been implemented in MPI/SX, NEC’s MPI implementation for the SX series of vector supercomputers.
Time: 20th Sep 2006, 11:30 – 12:45
Benchmarking MPI: The Challenges of Getting it Right
Benchmarking MPI is a contentious subject at best. Microbenchmarks are used because they are easy to port and, hypothetically, measure an important system characteristic in isolation. The unfortunate reality is that it is remarkably difficult to create a benchmark that is a fair measurement in the context of modern system. Software optimizations and
modern processor architecture perform extremely efficiently on benchmarks, where it would not in an application context. This paper explores the challenges faced when benchmarking the network in a modern microprocessor climate and the remarkable impacts on the results that are obtained.
Implementation and Usage of the PERUSE-Interface in Open MPI
In this paper we describe the implementation, usage and experience with the MPI performance revealing extension interface ( Peruse ) into the Open MPI implementation. While the PMPI-interface allows timing MPI-functions through wrappers, it can not provide MPI-internal information on MPI-states and lower-level network performance. We introduce the general design criteria of the interface implementation and analyze the overhead generated by this functionality. To support performance evaluation of large-scale applications, tools for visualization are imperative. We extend the tracing library of the Paraver-toolkit to support tracing Peruse-events and show how this helps detecting performance bottlenecks. A test-suite and a real-world application are traced and visualized using Paraver.
Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces
In evaluating new high-speed network interfaces, the usual metrics of latency and bandwidth are commonly measured and reported. There are numerous other message passing characteristics that can have a dramatic effect on application performance, and they too should be analyzed when evaluating a new interconnect. One such metric is overhead, which dictates the networks ability to allow the application to perform non-message passing work while a transfer is taking place. A method for measuring overhead, and hence calculating application availability is presented. Results for several next generation network interfaces are also presented.
Time: 20th Sep 2006, 11:25 – 13:15
An approach for parallel fluid-structure interaction on unstructured meshes
The simulation of fluid-structure interaction (FSI) problems is a challenge in contemporary science and engineering. This contribution presents an approach to FSI problems with incompressible Newtonian fluids and elastic structures and discusses its realization in a general purpose parallel finite element research code. The resulting algorithm is robust and effcient and scales well on parallel machines. Recent attempts on effciency improvements are discussed and a numerical example is shown.
MPJ Express Meets Gadget: Towards a Java Code for Cosmological Simulations
Gadget-2 is a massively parallel structure formation code for cosmological simulations. In this paper, we present a Java version of Gadget-2. We evaluated the performance of the Java version by running a colliding galaxy simulation and found that it can achieve around 70% of C Gadget-2’s performance.
Optimizing a Conjugate Gradient Solver with Non Blocking Collective Operations
This paper presents a case study about the applicability and usage of non blocking collective operations. These operations provide the ability to overlap communication with computation and to avoid unnecessary synchronization. We introduce our NBC library, a portable low-overhead implementation of non blocking collectives on top of MPI-1. We demonstrate the easy usage of the NBC library with the optimization of a conjugate gradient solver with only minor changes to the traditional parallel implementation of the program. The optimized solver runs up to 34% faster and is able to overlap most of the communication. We show that there is, due to the overlap, no performance difference between Gigabit Ethernet and InfiniBand for our calculation.
Parallel DSMC gasflow simulation of an in-line coater for reactive sputtering
There is an increasing demand for high precision coatings on large areas via in-line reactive sputtering, which requires advanced process control techniques. Thus, an improved theoretical understanding of the reactive sputtering process kinetics is mandatory for further technical improvement. We present a detailed Direct Simulation Monte Carlo (DSMC) gas flow model of an in-line sputtering coater for large area architectural glazing. With this model, the pressure fluctuations caused by a moving substrate are calculated in comparison with the experiment. The model reveals a significant phase shift in the pressure fluctuations between the areas above the center and the edges of the substrate. This is a geometric effect and is e. g. independent of the substrate travelling direction. Consequently, a long sputtering source will observe pressure fluctuations at its center and edges, which are out of phase. For a heuristic model of the reactive sputtering process, we show that in certain cases a two-dimensional model treatment is suffcient for predicting the film thickness distribution on the moving substrate. In other cases, a strong phase shift between averaged pressure fluctuations and reactive sputtering process response is observed indicating that a three-dimensional model treatment is required for a realistic simulation of the in-line deposition process.
Parallel simulation of T-M processes in underground repository of spent nuclear waste
The contribution deals with mathematical (finite element) simulation of the KBS prototype nuclear waste repository in a simplified form, i.e. as a thermo-elasticity problem. It describes the solvers developed for such kind of problems and principles and benefits of their parallelization, both in MPI and OpenMP.
Fault Tolerance II
Time: 20th Sep 2006, 14:50 – 15:40
An Intelligent Management of Fault Tolerance in cluster using RADICMPI
A solution to implement fault tolerance in the modern cluster must be efficient, transparent and scalable. In order to attend such requisites we developed an architecture called RADIC, Redundant Array of Distributed Independent Checkpoints. Such architecture bases on a fully distributed array of processes that collaborate in order to create a fault tolerance controller that manages the fault tolerance activities transparently to the application. Using the RADIC concepts, RADICMPI implements standard MPI directives and a fault tolerance mechanism based on message-log rollback-recovery protocol. Such mechanism efficiently manages all fault tolerance activities transparently to the users.
Extended mpiJava for Distributed Checkpointing and Recovery
In this paper we describe an mpiJava extension that implements a parallel checkpointing/recovery service. This checkpointing/recovery facility is transparent to applications, i.e. no instrumentation is needed. We use a distributed approach for taking the checkpoints, which means that the processes take their local checkpoints with little coordination. This approach reduce communication between processes and a central server for checkpoint storage is not needed. We present some experiments which suggest that the benefits of this extended MPI functionality do not have as a side effect a significant performance penalty, apart from the well-known penalties related to the local checkpoint generation.
Parallel I/O II
Time: 20th Sep 2006, 14:50 – 15:40
Effective Seamless Remote MPI-I/O Operations with Derived Data Types Using PVFS2
Parallel computation outputs intermediate data periodically, and typically the outputs are accessed for visualization in remote operation. Parallel NetCDF provides parallel I/O operations inside a computer, but it is not available in remote operations. It is required to realize remote MPI-I/O operations with derived data types among computers which have different MPI libraries for the operations. To realize this kind of operations, a Stampi library was proposed. For effective data-intensive I/O, a PVFS2 file system has been supported in its remote MPI-I/O operations by introducing MPICH as an underlying MPI library. This mechanism has been evaluated on interconnected PC clusters, and sufficient performance has been achieved with huge amount of data. In this paper, architecture, execution mechanism, and preliminary performance results are reported and discussed.
High-Bandwidth Remote Parallel I/O with the Distributed Memory Filesystem MEMFS
The enormous advance in computational power of supercomputers enables scientific applications to process problems of increasing size. This is often correlated with an increasing amount of data stored in (parallel) filesystems. As the increase in bandwith of common disk based I/O devices can not keep up with the evolution of computational power, the access to this data becomes the bottleneck in many applications. MEMFS takes the approach to distribute \io data among multiple dedicated remote servers on a user-level basis. It stores files in the accumulated main memory of these I/O nodes and is able to deliver this data with high bandwidth. We describe how MEMFS manages a memory based distributed filesystem, how it stores data among the participating I/O servers and how it assigns servers to application clients. Results are given for a usage in a grid project with high-bandwidth \wan connections.
Time: 20th Sep 2006, 16:10 – 18:00:00
Efficient Allgather for Regular SMP-Clusters
We show how to adapt and extend a well-known allgather (all-to-all broadcast) algorithm to parallel systems with a hierarchical communication system like clusters of SMP nodes. For small problem sizes, the new algorithm requires a logarithmic number of communication rounds in the number of SMP nodes, and gracefully degrades towards a linear algorithm as problem size increases. The algorithm has been used to implement the MPI_Allgather collective operation of MPI in the MPI/SX library. Performance measurements on a 72 node SX-8 system shows that graceful degradation provides a smooth transition from logarithmic to linear behavior, and significantly outperforms a standard, linear algorithm. The performance of the latter is furthermore highly sensitive to the distribution of MPI processes over the physical processors.
presentation slides not available
Efficient Shared Memory and RDMA based design for MPI_Allgather over InfiniBand
MPI_Allgather is an important collective operation in MPI. With the next generation systems going multi-core the clusters deployed would enable a high process count per node. The traditional implementations of Allgather use two separate channels, network channel for communication across the nodes and shared memory channel for intra-node communication. This approach prevents sharing of communication buffers across the channels. This results in extra copying of data with in a node yielding sub-optimal performance. Especially for a collective involving large number of processes this problem become more critical. Our approach eliminates the extra copy costs by sharing the communication buffers for both intra and inter node communication. Further, we optimize the performance by allowing overlap of network operations with intra-node shared memory copies. On a 32, 2-way node cluster, we observe an improvement upto a factor of two for MPI_Allgather compared to the original implementation.
MPI Collective Algorithm Selection and Quadtree Encoding
In this paper, we focus on MPI collective algorithm selection process and explore the applicability of the quadtree encoding method to this problem. During the algorithm selection process, a particular MPI collective algorithm is selected based on the collective operation parameters. We construct quadtrees with different properties from the measured algorithm performance data and analyze the quality and performance of decision functions generated from these trees. The experimental data indicates that in some cases, the decision function based on quadtree structure with a mean depth of 3 can incur as little as a 5% performance penalty on average. The exact, experimentally measured, decision function for all tested collectives could be fully represented using quadtrees with a maximum of 6 levels. These results indicate that quadtrees may be a feasible choice for both processing of the performance data and automatic decision function generation.
Parallel Prefix (Scan) Algorithms for MPI
We describe and experimentally compare three theoretically well-known algorithms for the parallel prefix (or scan , in MPI terms) operation, and give a presumably novel, doubly-pipelined implementation of the in-order binary tree parallel prefix algorithm. Bidirectional interconnects can benefit from this implementation. We present results from a 32 node AMD Cluster with Myrinet 2000 and a 72-node SX-8 parallel vector system. On both systems, we observe improvements by more than a factor two over the straight-forward binomial-tree algorithm found in many MPI implementations. We also discuss adapting the algorithms to clusters of SMP nodes.
presentation slides not available
Metacomputing and Grid
Time: 20th Sep 2006, 16:10 – 18:00:00