Below is a chronological listing of the technical paper sessions with
titles, abstracts and slides of the presentations (if provided by
the presenting author). The technical papers that have been seleted
for the Outstanding Papers session are listed on the respective
page.
Implementation Issues
Time: 18th Sep 2006, 10:00 - 10:50 Location: K1/K
|
Automatic Memory Optimizations for Improving MPI Derived Datatype Performance
Surendra Byna, Xian-He Sun, Rajeev Thakur, William Gropp
MPI derived datatypes allow users to describe
noncontiguous memory layout and communicate noncontiguous data with a
single communication function. This powerful feature provides an MPI
implementation to optimize the transfer of noncontiguous data. In
practice, however, many implementations of MPI derived datatypes
perform poorly, which makes application developers avoid using this
feature. In this paper, we present a technique to automatically select
templates that are optimized for memory performance based on the access
pattern of derived datatypes. We implement this mechanism in the MPICH2
source code. The performance of our implementation is compared to
well-written manual packing/unpacking routines and original MPICH2
implementation. We show that performance for various derived datatypes
is significantly improved and comparable to that of optimized manual
routines.
presentation slides [PDF]
|
Improving the Dynamic Creation of Processes in MPI-2
Marcia Cristina Cera
The MPI-2 standard has been implemented for a few years in most of
the MPI distributions. Yet, the dynamic creation of processes, enabled
by MPI-2, turns it harder to handle their scheduling manually. This
paper presents a scheduler module, that has been implemented with
MPI-2, that determines, on-line, on which processor a newly spawned
process should be run. The scheduler can apply a basic Round-Robin
mechanism or use load information to apply a list scheduling policy,
for MPI-2 programs with dynamic creation of processes. A rapid
presentation of the scheduler is given, followed by experimental
evaluations on three test programs: the Fibonacci computation, the
N-Queens benchmark and a computation of prime numbers. Even with the
basic mechanisms that have been implemented, a clear gain is obtained
regarding the run-time, the load balance, and consequently regarding
the number of processes that can be run by the MPI program.
presentation slides [PDF] |
|
Object-Oriented Message Passing
Time: 18th Sep 2006, 10:00 - 10:50 Location: S1/S2
|
Modernizing the C++ Interface to MPI
Prabhanjan Kambadur, Douglas Gregor, Andrew Lumsdaine, Amey Dharurkar
MPI is the de facto standard for writing
message passing applications. Much of MPI's power stems from its
ability to provide a high-performance, consistent interface across C,
Fortran, and C++. Unfortunately, with cross-language consistency at the
forefront, MPI supports only the lowest common denominator of the three
languages, providing a level of abstraction far lower than what C++
programmers expect. For instance, MPI does not inherently support C++
containers and iterators, nor does it provide seamless support for
user-defined classes. To map these common C++ constructs into MPI,
programmers must often write non-trivial boiler-plate code and weaken
the type-safety guarantees provided by C++. This paper describes
several ideas for modernizing the C++ interface to MPI, providing a
more natural syntax along with seamless support for user-defined types
and C++ Standard Library constructs and sketch the C++ techniques
required to realize this interface without sacrificing performance.
presentation slides [PDF]
|
Non-blocking Java Communications Support on Clusters
Guillermo L. Taboada and Juan Touriño and Ramón Doallo
This paper presents communication strategies for supporting efficient
non-blocking Java communication on clusters. Communication performance
is critical for the overall cluster performance, whereas non-blocking
communications can reduce significantly the communication
overhead. The library proposed is focused on reducing the startup time
of communications, avoiding unnecessary copying and overlapping
computation and communication. This project provides the basis for a
Java Message-passing library to be implemented on top of it. The
quality of the solution has been evaluated on a Scalable
Coherent Interface (SCI) and Gigabit Ethernet (GbE) testbed
cluster. Experimental results show that the proposed library reduces
efficiently the communication overhead and increases computation and
communication overlapping. presentation slides [PDF]
|
|
Limitations and Extensions
Time: 18th Sep 2006, 11:20 - 12:35 Location: K1/K2
|
Can MPI Be Used for Persistent Parallel Services?
Robert Latham, Robert Ross, Rajeev Thakur
MPI is routinely used for writing parallel applications, but it is not
commonly used for writing long-running parallel services, such as
parallel file systems or job schedulers. Nonetheless, MPI does have
many features that are potentially useful for writing such
software. Using the PVFS2 parallel file system as a motivating
example, we studied the needs of software that provide persistent
parallel services and evaluated whether MPI is a good match for those
needs. We also ran experiments to determine the gaps between what the
MPI Standard enables and what MPI implementations currently
support. The results of our study indicate that MPI can enable
persistent parallel systems to be developed with less effort and can
provide high performance, but MPI implementations will need to provide
better support for certain features. We also describe an area where
additions to the MPI Standard would be useful. presentation slides [PDF]
|
Observations on MPI-2 Support for Hybrid Master/Slave Applications in Dynamic and Heterogeneous Environments
Claudia Leopold, Michael Süβ
Large-scale MPI programs must work with dynamic and heterogeneous
resources. While many of the involved issues can be handled by the MPI
implementation, some require support at the application level. This
paper considers a master/slave application, in which MPI processes
internally use a different number of threads created by OpenMP. We
modify the standard master/slave pattern to allow for dynamic addition
and withdrawal of slaves. Moreover, the application dynamically adapts
to use processors for either processes or threads. The paper
evaluates the support that MPI-2 provides for implementing the scheme,
partly referring to experiments with the MPICH2 implementation. We
found that most requirements can be met if optional parts of the
standard are used, but slave crashes require additional functionality.
presentation slides
[PDF] |
What MPI could (and cannot) do for Mesh- partitioning on Non-homogeneous Networks
Guntram Berti and Jesper Larsson Träff
We discuss the mesh-partitioning load-balancing problem for
non-homogeneous communication systems, and investigate whether the MPI
process topology functionality can aid in solving the
problem. An example kernel shows that specific communication patterns
can benefit substantially from a non-trivial MPI topology
implementation, achieving improvements beyond a factor of five for
certain system configurations. However, the topology functionality
also lacks certain expressivity to deal effectively with the
mesh-partitioning problem. A mild extension to MPI is suggested,
which, however, still cannot exclude possibly sub-optimal partitioning
results. Solving instead the mesh-partitioning problem completely
outside of MPI requires knowledge of the communication system. We
discuss ways in which such could be provided by MPI in a portable way.
Finally, we formulate and discuss a more general affinity
scheduling problem . presentation slides [PDF]
|
|
Performance ToolsTime: 18th Sep 2006, 11:20 - 12:35 Location: S1/S2
|
Scalable Parallel Trace-Based Performance Analysis
Markus Geimer, Felix Wolf, Brian J. N. Wylie, Bernd Mohr
Automatic trace analysis is an effective
method for identifying complex performance phenomena in parallel
applications. However, as the size of parallel systems grows further
and the number of processors used by individual applications is
continuously raised, the traditional approach of analyzing a single
global trace file, as done by KOJAK's EXPERT trace analyzer,
becomes increasingly constrained by the large number of events. In
this article, we present a scalable version of the EXPERT analysis
based on analyzing separate local trace files with a parallel tool
which 'replays' the target application's communication
behavior. We describe the new parallel analyzer architecture and
discuss first empirical results.
presentation slides [PDF]
|
TAUg: Runtime Global Performance Data Access Using MPI
Kevin A. Huck
To enable a scalable parallel application to view its global
performance state, we designed and developed TAUg , a
portable runtime framework layered on the TAU parallel performance
system. TAUg leverages the MPI library to communicate between
application processes, creating an abstraction of a global performance
space from which profile views can be retrieved. We describe the TAUg
design and implementation and show its use on two test benchmarks up
to 512 processors. Overhead evaluation for the use of TAUg is
included in our analysis. Future directions for improvement are
discussed. Keywords: parallel, performance, runtime, MPI,
measurement.
presentation slides [PDF]
|
Tracing the MPI-IO Calls' Disk Accesses
Thomas Ludwig, Stephan Krempel, Julian Kunkel, Frank Panse, Dulip Withanage
With parallel file I/O we are faced with the
situation that we do not have appropriate tools to get an insight into
the I/O server behavior depending on the I/O calls in the
corresponding parallel MPI program. We present an approach that allows
us to also get event traces from the I/O server environment and to
merge them with the client trace. Corresponding events will be matched
and visualized. We integrate this functionality into the parallel file
system PVFS2 and the MPICH2 tool
Jumpshot.
presentation slides [PDF]
|
|
Debugging and VerificationTime: 18th Sep 2006, 14:50 - 16:05 Location: K1/K2
|
An Interface to Support the Identification of Dynamic MPI 2 Processes for Scalable Parallel Debugging
Christopher L Gottbrath
This paper proposes an interface that will
allow MPI 2 dynamic programs, those using MPI SPAWN, CONNECT/ACCEPT or
JOIN, to provide information to parallel debuggers such as TotalView
about the set of processes that constitute an individual
application. The TotalView parallel debugger currently obtains
information about the identify the processes directly from the MPI
library using a widely accepted proctable interface. The existing
interface does not support MPI 2 dynamic operations. The proposed
interface supports MPI 2 dynamic operations, subset debugging, and
helps the parallel debugger assign meaningful names to processes.
presentation slides [PDF]
|
Correctness Checking of MPI One-sided Communication Using Marmot
Bettina Krammer
The MPI-2 standard defines functions for
Remote Memory Access (RMA) by allowing one process to specify all
communication parameters both for the sending and the receiving side,
which is also referred to as one-sided communication. Having
experienced parallel programming as a complex and error-prone task, we
have developed the MPI correctness checking tool MARMOT covering the
MPI-1.2 standard and are now aiming at extending it to support
application developers also for the more frequently used parts of
MPI-2 such as one-sided communication. In this paper we describe our
tool, which is designed to check the correct usage of the MPI API
automatically at run-time, and we also analyse to what extent it is
possible to do so for RMA.
presentation slides [PDF]
|
Modeling and Verification of MPI Based Distributed Software
Igor Grudenic, Nikola Bogunovic
Communication between processes in
distributed environment is implemented using either shared memory or
message passing paradigm. The message passing paradigm is used more
often due to the lesser hardware requirements. MPI is a standardized
message passing API with several independent implementations.
Specification and verification of distributed systems is generally a
challenging task. In this paper we present a case study of
specification and verification of MPI based software using abstract
state machines (ASMs).
presentation slides [PDF]
|
|
Communication ProtocolsTime: 18th Sep 2006, 14:50 - 16:05 Location: S1/S2
|
High Performance RDMA Protocols in HPC
Galen Mark Shipman, Tim S. Woodall, George Bosilca, Rich L. Graham, Arthur B. Maccabe
Modern network interconnects that use RDMA
may offer significant performance advantages over conventional
send/recv network semantics. However, this high performance often
comes with hidden costs such as requiring an exchange of information
prior to an RDMA operation and requiring initiator/target to preserve
physical to virtual memory mappings during the RDMA. This paper
describes a unique MPI library Â$(Api(Bpeline' protocol that
addresses these constraints while avoiding the pitfalls of existing
techniques. By effectively overlapping registration with RDMA
operations this protocol provides good performance for any memory
usage pattern. This approach avoids the use of non-portable memory
hooks or not returning pages to the OS. Through this approach,
bandwidth may be increased up to 67% when buffers are not reused while
providing superior performance in the effective bandwidth
benchmark. Several user level protocols are explored using Open MPI
and compared to this Â$(Api(Bpeline' protocol.
presentation slides [PDF]
|
Implementation and Shared-Memory Evaluation of MPICH2 over the Nemesis Communication Subsystem
Darius Buntinas, Guillaume Mercier, and William Gropp
This paper presents the implementation of
MPICH2 over the Nemesis communication subsystem and the evaluation of
its shared-memory performance. We describe design issues as well as
some of the optimization techniques we employed. We conducted a
performance evaluation over shared memory using microbenchmarks as
well as application benchmarks. The evaluation shows that MPICH2
Nemesis has very low communication overhead, making it suitable for
smaller-grained applications.
presentation slides [PDF]
|
MPI/CTP: A Reconfigurable MPI for HPC Applications
Manjunath Gorentla Venkata and Patrick G. Bridges
Modern MPI applications have diverse communication requirements, with
trends showing that they are moving from static communication
requirements to more dynamic and evolving communication
requirements. However, MPI libraries, which integrate MPI applications
with the hardware, are not flexible enough to accommodate these
diverse needs. This lack of flexibility leads to degraded performance
of the applications. In this paper, we present the design of a
protocol development framework and an MPI library implemented using
our proposed framework that support compile-time and boot-time
protocol configuration, as well as runtime protocol reconfiguration
based on dynamic application requirements. Experimental results on the
initial prototype of this design show that this prototype is able to
dynamically reconfigure at runtime to optimize bandwidth under
changing MPI requirements.
presentation slides [PDF]
| |
Fault Tolerance ITime: 19th Sep 2006, 14:50 - 15:40 Location: K1/K2
|
FT-MPI, Fault Tolerant Metacomputing and Generic Name Services : a Case Study
David Dewolfs
There is a growing interest in deploying MPI
over very large numbers of heterogenous, geographically distributed
resources. FT-MPI provides the fault-tolerance necessary at this
scale, but presents some issues when crossing multiple administrative
domains. Using the H2O metacomputing framework, we add
cross-administrative domain interoperability and
"pluggability" to FT-MPI. The latter feature allows us,
using proxies, to transparently replace one vulnerable module - its
name service with fault-tolerant replacements. We present an algorithm
for improving performance of operations over the proxies. We evaluate
its performance in a comparison using the original name service,
OpenLDAP and current Emory research project HDNS.
presentation slides [PDF]
|
Scalable Fault Tolerant Protocol for Parallel Runtime Environments
Thara Angskun, Graham E. Fagg, George Bosilca, Jelena Pjesivac-Grbovic and Jack J. Dongarra
The number of processors embedded on high
performance computing platforms is growing daily to satisfy users
desire for solving larger and more complex problems. Parallel runtime
environments have to support and adapt to the underlying libraries and
hardware which require a high degree of scalability in dynamic
environments. This paper presents the design of a scalable and fault
tolerant protocol for supporting parallel runtime environment
communications. The protocol is designed to support transmission of
messages across multiple nodes with in a self-healing topology to
protect against recursive node and process failures. A formal protocol
verification has validated the protocol for both the normal and
failure cases. We have implemented multiple routing algorithms for the
protocol and concluded that the variant rule-based routing algorithm
yields the best overall results for damaged and incomplete
topologies.
presentation slides [PDF]
|
|
Parallel I/O ITime: 19th Sep 2006, 14:50 - 15:40 Location: S1/S2
|
Exploiting Shared Memory to Improve Parallel I/O Performance
Andrew B. Hastings and Alok Choudhary
We explore several methods utilizing
system-wide shared memory to improve the performance of MPI-IO,
particularly for non-contiguous file access. We introduce an
abstraction called the datatype iterator that permits
efficient, dynamic generation of (offset, length) pairs for a given
MPI derived datatype. Combining datatype iterators with overlapped I/O
and computation, we demonstrate how a shared memory MPI implementation
can utilize more than 90% of the available disk bandwidth (in some
cases representing a 5x performance improvement over existing methods)
even for extreme cases of non-contiguous datatypes. We generalize our
results to suggest possible parallel I/O performance improvements on
systems without global shared memory.
presentation slides [PDF]
|
Self-Adaptive Hints for Collective I/O
Joachim Worringen
The processing of MPI-IO operations can be
controlled via the MPI API using file hints, which are passed to the
MPI library as MPI info objects. A file hint can affect how the MPI
library accesses the file on the file system level, it can set buffer
sizes, turn special optimizations on and off or whatever parameters
the MPI implementation provides. However, experience shows that file
hints are rarely used for reasons that will be discussed in the
paper.We present a new approach which dynamically determines the
optimal setting for file hints related to collective MPI-IO
operations. The chosen settings adapt to the actual file access
pattern, the topology of the MPI processes and the available memory
resources and consider the characteristics of the underlying file
system. We evaluate our approach which has been implemented in MPI/SX,
NEC's MPI implementation for the SX series of vector
supercomputers.
presentation slides [PDF]
|
|
Performance MeasurementTime: 20th Sep 2006, 11:30 - 12:45 Location: K1/K2
|
Benchmarking MPI: The Challenges of Getting it Right
Keith D. Underwood
Benchmarking MPI is a contentious subject at
best. Microbenchmarks are used because they are easy to port and,
hypothetically, measure an important system characteristic in
isolation. The unfortunate reality is that it is remarkably difficult
to create a benchmark that is a fair measurement in the context of
modern system. Software optimizations and modern processor
architecture perform extremely efficiently on benchmarks, where it
would not in an application context. This paper explores the
challenges faced when benchmarking the network in a modern
microprocessor climate and the remarkable impacts on the results that
are obtained.
presentation slides [PDF]
|
Implementation and Usage of the PERUSE-Interface in Open MPI
Rainer Keller, George Bosilca, Graham Fagg, Michael Resch, Jack Dongarra
In this paper we describe the
implementation, usage and experience with the MPI performance
revealing extension interface ( Peruse ) into the Open MPI
implementation. While the PMPI-interface allows timing MPI-functions
through wrappers, it can not provide MPI-internal information on
MPI-states and lower-level network performance. We introduce the
general design criteria of the interface implementation and analyze
the overhead generated by this functionality. To support performance
evaluation of large-scale applications, tools for visualization are
imperative. We extend the tracing library of the
Paraver-toolkit to support tracing Peruse-events and show how this
helps detecting performance bottlenecks. A test-suite and a real-world
application are traced and visualized using Paraver.
presentation slides [PDF]
|
Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces
Douglas Doerfler and Ron Brightwell
In evaluating new high-speed network
interfaces, the usual metrics of latency and bandwidth are commonly
measured and reported. There are numerous other message passing
characteristics that can have a dramatic effect on application
performance, and they too should be analyzed when evaluating a new
interconnect. One such metric is overhead, which dictates the networks
ability to allow the application to perform non-message passing work
while a transfer is taking place. A method for measuring overhead, and
hence calculating application availability is presented. Results for
several next generation network interfaces are also presented.
presentation slides [PDF]
|
|
ParSim 2006Time: 20th Sep 2006, 11:25 - 13:15 Location: S1/S2
|
An approach for parallel fluid-structure interaction on unstructured meshes
Ulrich Küttler and Wolfgang A. Wall
The simulation of fluid-structure
interaction (FSI) problems is a challenge in contemporary science and
engineering. This contribution presents an approach to FSI problems
with incompressible Newtonian fluids and elastic structures and
discusses its realization in a general purpose parallel finite element
research code. The resulting algorithm is robust and effcient and
scales well on parallel machines. Recent attempts on effciency
improvements are discussed and a numerical example is shown.
presentation slides [PDF]
|
MPJ Express Meets Gadget: Towards a Java Code for Cosmological Simulations
Mark Baker, Bryan Carpenter, and Aamir Shafi
Gadget-2 is a massively parallel structure formation code for
cosmological simulations. In this paper, we present a Java version of
Gadget-2. We evaluated the performance of the Java version by running
a colliding galaxy simulation and found that it can achieve around 70%
of C Gadget-2's performance.
presentation slides [PDF]
|
Optimizing a Conjugate Gradient Solver with Non Blocking Collective Operations
Torsten Hoefler
This paper presents a case study about the applicability and usage of
non blocking collective operations. These operations provide the
ability to overlap communication with computation and to avoid
unnecessary synchronization. We introduce our NBC library, a portable
low-overhead implementation of non blocking collectives on top of
MPI-1. We demonstrate the easy usage of the NBC library with the
optimization of a conjugate gradient solver with only minor changes to
the traditional parallel implementation of the program. The optimized
solver runs up to 34% faster and is able to overlap most of the
communication. We show that there is, due to the overlap, no
performance difference between Gigabit Ethernet and InfiniBand for our
calculation.
presentation slides [PDF]
|
Parallel DSMC gasflow simulation of an in-line coater for reactive sputtering
Andreas Pflug, M. Siemers, B. Szyszka
There is an increasing demand for high precision coatings on large
areas via in-line reactive sputtering, which requires advanced process
control techniques. Thus, an improved theoretical understanding of the
reactive sputtering process kinetics is mandatory for further
technical improvement. We present a detailed Direct Simulation Monte
Carlo (DSMC) gas flow model of an in-line sputtering coater for large
area architectural glazing. With this model, the pressure fluctuations
caused by a moving substrate are calculated in comparison with the
experiment. The model reveals a significant phase shift in the pressure
fluctuations between the areas above the center and the edges of the
substrate. This is a geometric effect and is e. g. independent of the
substrate travelling direction. Consequently, a long sputtering source
will observe pressure fluctuations at its center and edges, which
are out of phase. For a heuristic model of the reactive sputtering
process, we show that in certain cases a two-dimensional model
treatment is suffcient for predicting the film thickness distribution on
the moving substrate. In other cases, a strong phase shift between
averaged pressure fluctuations and reactive sputtering process response
is observed indicating that a three-dimensional model treatment is
required for a realistic simulation of the in-line deposition
process.
presentation slides [PDF]
|
Parallel simulation of T-M processes in underground repository of spent nuclear waste
Jiri Stary, Ondrej Jakl, Roman Kohut
The contribution deals with mathematical
(finite element) simulation of the KBS prototype nuclear waste
repository in a simplified form, i.e. as a thermo-elasticity problem.
It describes the solvers developed for such kind of problems and
principles and benefits of their parallelization, both in MPI and
OpenMP.
presentation slides [PDF]
|
|
Fault Tolerance IITime: 20th Sep 2006, 14:50 - 15:40 Location: K1/K2
|
An Intelligent Management of Fault Tolerance in cluster using RADICMPI
Angelo A Duarte, Dolores Rexachs and Emilio Luque
A solution to implement fault tolerance in
the modern cluster must be efficient, transparent and scalable. In
order to attend such requisites we developed an architecture called
RADIC, Redundant Array of Distributed Independent Checkpoints. Such
architecture bases on a fully distributed array of processes that
collaborate in order to create a fault tolerance controller that
manages the fault tolerance activities transparently to the
application. Using the RADIC concepts, RADICMPI implements standard
MPI directives and a fault tolerance mechanism based on message-log
rollback-recovery protocol. Such mechanism efficiently manages all
fault tolerance activities transparently to the users.
presentation slides [PDF]
|
Extended mpiJava for Distributed Checkpointing and Recovery
Emilio Hernandez,Yudith Cardinale and Wilmer Pereira
In this paper we describe an mpiJava extension that implements a
parallel checkpointing/recovery service. This checkpointing/recovery
facility is transparent to applications, i.e. no instrumentation is
needed. We use a distributed approach for taking the checkpoints,
which means that the processes take their local checkpoints with
little coordination. This approach reduce communication between
processes and a central server for checkpoint storage is not
needed. We present some experiments which suggest that the benefits of
this extended MPI functionality do not have as a side effect a
significant performance penalty, apart from the well-known penalties
related to the local checkpoint generation.
presentation slides [PDF]
| |
Parallel I/O IITime: 20th Sep 2006, 14:50 - 15:40 Location: S1/S2
|
Effective Seamless Remote MPI-I/O Operations with Derived Data Types Using PVFS2
Yuichi Tsujita
Parallel computation outputs intermediate
data periodically, and typically the outputs are accessed for
visualization in remote operation. Parallel NetCDF provides parallel
I/O operations inside a computer, but it is not available in remote
operations. It is required to realize remote MPI-I/O operations with
derived data types among computers which have different MPI libraries
for the operations. To realize this kind of operations, a Stampi
library was proposed. For effective data-intensive I/O, a PVFS2 file
system has been supported in its remote MPI-I/O operations by
introducing MPICH as an underlying MPI library. This mechanism has
been evaluated on interconnected PC clusters, and sufficient
performance has been achieved with huge amount of data. In this paper,
architecture, execution mechanism, and preliminary performance results
are reported and discussed.
presentation slides [PDF]
|
High-Bandwidth Remote Parallel I/O with the Distributed Memory Filesystem MEMFS
Jan Seidel, Rudolf Berrendorf, Marcel Birkner, Marc-Andre Hermanns
The enormous advance in computational power
of supercomputers enables scientific applications to process problems
of increasing size. This is often correlated with an increasing amount
of data stored in (parallel) filesystems. As the increase in bandwith
of common disk based I/O devices can not keep up with the evolution of
computational power, the access to this data becomes the bottleneck in
many applications. MEMFS takes the approach to distribute \io data
among multiple dedicated remote servers on a user-level basis. It
stores files in the accumulated main memory of these I/O nodes and is
able to deliver this data with high bandwidth. We describe how MEMFS
manages a memory based distributed filesystem, how it stores data
among the participating I/O servers and how it assigns servers to
application clients. Results are given for a usage in a grid project
with high-bandwidth \wan connections.
presentation slides [PDF]
|
|
Collective CommunicationTime: 20th Sep 2006, 16:10 - 18:00:00 Location: K1/K2
|
Efficient Allgather for Regular SMP-Clusters
Jesper Larsson Träff
We show how to adapt and extend a well-known
allgather (all-to-all broadcast) algorithm to parallel
systems with a hierarchical communication system like clusters of SMP
nodes. For small problem sizes, the new algorithm requires a
logarithmic number of communication rounds in the number of SMP nodes,
and gracefully degrades towards a linear algorithm as
problem size increases. The algorithm has been used to implement the
MPI_Allgather collective operation of MPI in the MPI/SX
library. Performance measurements on a 72 node SX-8 system shows that
graceful degradation provides a smooth transition from logarithmic to
linear behavior, and significantly outperforms a standard, linear
algorithm. The performance of the latter is furthermore highly
sensitive to the distribution of MPI processes over the physical
processors.
presentation slides not available
|
Efficient Shared Memory and RDMA based design for MPI_Allgather over InfiniBand
Amith Rajith Mamidala, Abhinav Vishnu, Dhabaleswar K Panda
MPI_Allgather is an important collective
operation in MPI. With the next generation systems going multi-core
the clusters deployed would enable a high process count per node. The
traditional implementations of Allgather use two separate channels,
network channel for communication across the nodes and shared memory
channel for intra-node communication. This approach prevents sharing
of communication buffers across the channels. This results in extra
copying of data with in a node yielding sub-optimal
performance. Especially for a collective involving large number of
processes this problem become more critical. Our approach eliminates
the extra copy costs by sharing the communication buffers for both
intra and inter node communication. Further, we optimize the
performance by allowing overlap of network operations with intra-node
shared memory copies. On a 32, 2-way node cluster, we observe an
improvement upto a factor of two for MPI_Allgather compared to the
original implementation.
presentation slides [PDF]
|
MPI Collective Algorithm Selection and Quadtree Encoding
Jelena Pjesivac-Grbovic, Graham E. Fagg, Thara Angskun, George Bosilca, Jack J. Dongarra
In this paper, we focus on MPI collective algorithm selection process
and explore the applicability of the quadtree encoding method to this
problem. During the algorithm selection process, a particular MPI
collective algorithm is selected based on the collective operation
parameters. We construct quadtrees with different properties from the
measured algorithm performance data and analyze the quality and
performance of decision functions generated from these trees. The
experimental data indicates that in some cases, the decision function
based on quadtree structure with a mean depth of 3 can incur as little
as a 5% performance penalty on average. The exact, experimentally
measured, decision function for all tested collectives could be fully
represented using quadtrees with a maximum of 6 levels. These results
indicate that quadtrees may be a feasible choice for both processing
of the performance data and automatic decision function generation.
presentation slides [PDF]
|
Parallel Prefix (Scan) Algorithms for MPI
Peter Sanders and Jesper L. Träff
We describe and experimentally compare three theoretically well-known
algorithms for the parallel prefix (or scan , in MPI terms) operation,
and give a presumably novel, doubly-pipelined implementation of the
in-order binary tree parallel prefix algorithm. Bidirectional
interconnects can benefit from this implementation. We present results
from a 32 node AMD Cluster with Myrinet 2000 and a 72-node SX-8
parallel vector system. On both systems, we observe improvements by
more than a factor two over the straight-forward binomial-tree
algorithm found in many MPI implementations. We also discuss adapting
the algorithms to clusters of SMP nodes.
presentation slides not available
|
|
Metacomputing and GridTime: 20th Sep 2006, 16:10 - 18:00:00 Location: S1/S2
|
Orchestration of distributed MPI-Applications in a UNICORE-based Grid with MetaMPICH and MetaScheduling
Boris Bierbaum, Carsten Clauss, Thomas Eickermann, Lidia Kirtchakova, Arnold Krechel, Stephan Springstubbe, Oliver Wä
ldrich, Wolfgang Ziegler
Running large MPI-applications with resource demands exceeding the
local site's cluster capacity could be distributed across a number of
clusters in a Grid instead, to satisfy the demand. However, there are
a number of drawbacks limiting the applicability of this approach:
communication paths between compute nodes of different clusters
usually provide lower bandwidth and higher latency than the cluster
internal ones, MPI libraries use dedicated I/O-nodes for inter-cluster
communication which become a bottleneck, missing tools for
co-ordinating the availability of the different clusters across
different administrative domains is another issue. Thus, to make the
Grid approach efficient several prerequisites must be in place. We
describe work recently completed in the VIOLA project that
successfully created and integrated these essential technologies:
MetaMPICH, user controlled QoS of clusters and interconnecting
network, a MetaScheduling service and the UNICORE integration.
presentation slides [PDF]
|
Running PVM Applications on Multidomain Clusters
Franco Frattolillo
PVM has been developed to enable PVM applications to run across
multidomain clusters made up by computing nodes belonging to
non-routable private networks, but connected to the Internet through
publicly addressable IP front-end nodes. However, ePVM cannot relieve
programmers of the classic burden tied to the problems related to the
deployment of PVM runtime libraries and program executables among
computational resources belonging to distinct administrative
domains. This paper presents ehPVM, a lightweight software
infrastructure which enables programmers to deploy ePVM applications
among computational resources belonging to multidomain clusters. This
way, ePVM programmers can run their applications without having to use
grid software toolkits or resources providers whose configurations
usually result in being tedious and time-consuming activities.
presentation slides [PDF]
|
The new Multidevice Architecture of MetaMPICH in the Context of other Approaches to Grid-enabled MPI
Boris Bierbaum, Carsten Clauss, Martin Pöppe, Stefan Lankes, Thomas Bemmerl
MetaMPICH is an MPI implementation which allows the coupling of
different computing resources to form a heterogeneous computing system
called a meta computer . Such a coupled system may consist of multiple
compute clusters, MPPs, and SMP servers, connected by different
networks like Ethernet, SCI, Myrinet, and Infiniband. There are
several other MPI libraries with similar goals available. We present
the three most important of them and contrast their features and
abilities to one another and to MetaMPICH.We especially highlight the
recent advances made to MetaMPICH, namely the development of the new
multidevice architecture for building a meta computer.
presentation slides [PDF]
|
Using an Enterprise Grid for Execution of MPI Parallel Applications Â$(H A(B Case Study
Adam Kai-Leung Wong and Andrzej M. Goscinski
A modern enterprise has not only a single cluster but a set of
geographically distributed clusters Â$(H t(Bhey could be used to form an
enterprise grid. In this paper we show based on our case study that
enterprise grids can be efficiently used as parallel computers to
carry out high-performance computing.
presentation slides [PDF]
|
|
|
|