HPC4U Fault Tolerant Resource Management: Internet Grid Migration Released

WEBWIRE – Monday, February 11, 2008

Brussels, February 11, 2008 – CETIC, ICT research centre announce that the HPC4U European research project (FP6), active in Grid computing technologies, released a cluster middleware providing fault tolerance for parallel applications and allowing job migration over the Grid. This system offers the user the possibility to negotiate on Service Level Agreements (SLAs), then running parallel applications on a given site (e.g. CETIC, Belgium). The jobs are regularly checkpointed in an application transparent manner. If a failure occurs (e.g. node outage), the HPC4U cluster middleware is not only able to restart the job locally, but also to migrate the job over the Grid on a remote site (e.g. University of Paderborn, Germany or Fujitsu, France). There the computation restarts from the latest checkpoint. This mechanism prevents loosing computation time and ensures SLA-compliance also in the case of resource failures.
HPC4U Technology – Internet Grid Migration
HPC4U worked on the realization of an SLA-aware Grid fabric, which is consisting of multiple elements. An open-source resource management system, OpenCCS (http://www.openccs.eu), developed by the University of Paderborn, represents the top layer element. It is responsible for managing the cluster in general, as well as to serve as the master interface to upper layer clients. Within HPC4U, the Technical University of Berlin integrated the resource management system with the Globus Toolkit 4 in order to set up a Grid ready environment able to migrate jobs over the internet on remote available resources. This integration also allows Grid middleware components to negotiate on Service Level Agreements (SLA). The resource management system is responsible for only accepting jobs, where the SLA can be fulfilled in the current system condition. In particular it is responsible for fulfilling all agreed SLAs, even in case of failures, e.g. resource outages.
At the cluster level, the resource management system interacts with several subcomponents offering fault tolerance mechanisms. The MetaCluster checkpointing subsystem of IBM provides process fault tolerance mechanisms, the storage subsystem of Seanodes (VSM/Metanode and Exanodes) offers storage virtualisation coupled with fault tolerance mechanisms. The third system, in charge of network aspects is made of Scali MPI libraries and Dolphin SCI interconnect.
About HPC4U
Grid computing is an established technology in the academic sector, allowing the transparent access to distributed resources. For also attracting the commercial user important standards such as reliability, transparency and Quality of Service (QoS) still needs to be officially recognised as major requirements for the implementation of future Grids at a commercial level.
The HPC4U project’s (Highly Predictable Cluster for Internet Grids) main objective is to provide an application-transparent and software-only solution of a reliable Resource Management System. It allows the Grid to negotiate on Service Level Agreements, and it also provides mechanisms like process and storage checkpointing to realise Fault Tolerance and to assure the adherence with given SLAs. The HPC4U solution acts as an active Grid component, using available Grid resources for further improving its level of Fault Tolerance.
The HPC4U solution is made a mix of open source and proprietary software embedded in three outcomes. The first outcome is, SLA-aware and Grid-enabled Resource Management System including SLA negotiation, multi-site SLA-aware scheduling, security and interfaces for storage, checkpointing, and networking support. It is multi-platform in nature and available as open source. The second HPC4U outcome is a vertically integrated commercial product with proprietary Linux-specific developments for storage, networking and checkpointing. The third outcome is also a vertically integrated system, consisting of freeware components only. This outcome is ready to be used and can be downloaded, like all other materials and sources, from the HPC4U website.
HPC4U partners are CETIC, University of Paderborn, Technical University of Berlin, University of Linkoping, Scali, Dolphin, Fujitsu, Seanodes and IBM.

For more information: http://www.hpc4u.eu

Related Links: www.hpc4u.eu; www.cetic.be

WebWireID58829

: grid; fault tolerance; SLA

Contact Information: Simon Alexandre; Project Manager; CETIC; Contact via E-mail

This news content may be integrated into any legitimate news gathering and publishing effort. Linking is permitted.

News Release Distribution and Press Release Distribution Services Provided by WebWire.

News and Press Release Distribution, Since 1995

Deliver Your News to the World

HPC4U Fault Tolerant Resource Management: Internet Grid Migration Released

Distribute Your News