A proposal for an

Alliance for Computational Earth Science

ACES 

prepared by John Marshall and Chris Hill
Earth, Atmospheric and Planetary Sciences, MIT

Summary

A group of Earth scientists, engineers and computer scientists from the Schools of Science and Engineering, propose to set up a virtual network of four "critical-mass" cluster computer systems together with a distributed network of smaller 'feeder' systems. The resource would be used to tackle a number of key problems in Earth science using common computation techniques and a shared resource. The clusters will have a common design and could be used independently or as one. The project leverages off previous collaborations between computer scientists, physical scientists and industry at MIT in the use of cluster computing in scientific computation.  The shared computer resource could serve a model of collaborative, cost effective, high performance computing at MIT.

A more detailed description of our goals and future plans can be found here.  A picture of the compute clusters and grid that we are planning to set up is here.

The group - the Alliance for Computational Earth Science (ACES) - have raised ~ $1m in support of the initiative. We are now actively seeking the support of MIT and industry to bring the project to fruition.

Key Thrusts

The intellectual foci of the Alliance for Computational Earth Science are:

·       Computational technologies for understanding the state of the planet and its evolution over time, both in the past and in to the future.

o      Algorithms and systems for planetary scale phenomena, including physical, biological and chemical phenomena.

·       Computational technologies for monitoring the planet.

o      Real-time and historic synthesis of planetary scale in-situ and space-borne observations together with models. Continuous monitoring of the Earth system state, reconstructions of past records.

·        Computational technologies for educating students about planetary systems.

o      Combined laboratory-sensor-computation hands-on teaching program with scale-up to planetary problems.

Introduction

The Earth - its atmosphere, oceans, cryosphere, land surface and interior - is continually monitored by numerous arrays of sophisticated sensor networks.  Space borne and in-situ sensors yield a real-time view of the state of the entire planet that is unprecedented in accuracy and in raw detail. Making sense of  this wealth of information and distilling clear, quantitative inferences from it, is a formidable intellectual and computational challenge. 

A close integration between measurement, modeling, computation and theory is required. Not only must forward models of the earth system be developed which encode our understanding of the laws that govern the evolving planet, but we must also develop tools to quantify, and take account of, errors in these models. Moreover, computer models provide the natural conduit through which prior knowledge and theoretical understanding of the evolving system, can to be used to synthesize multiple data streams. The resulting combination of models constrained by observations provides not only a powerful tool to probe the workings of the earth as it is today, but how it may have been in the past and may evolve in the future.

A group of scientists from the School of Science and the School of Engineering at MIT have recently been meeting to compare notes on coordinated computational strategies in this broad area of model/data synthesis. It has become clear that, although the scientific focus varies widely between, say, seismic imaging and paleo-climate reconstructions, there are a number of common computational challenges and solutions that act as a unifying theme. One computational strategy that several groups at MIT have gravitated towards are local computing clusters made up of commodity microprocessors. These systems can deliver high-performance, cost-effective interactive compute resources to the fingertips of researchers and so accelerate innovation.

Our emerging alliance has identified a common, but distributed computational approach. Our plan is to set up a virtual network of four "critical-mass" cluster systems together with a distributed network of smaller, feeder systems. They would have a common design and could be used together or as one.

Here we briefly

set out our strategy
describe the proposed architecture of the distributed system
outline the benefits
summarize the costs.

Members of the alliance are listed here. The cluster systems presently being operated by them are here.
A technical blueprint of the proposed system can be found here.

By pooling resources ACES has of order $1m of funds committed over the next three years to set up this system. We are now actively seeking further support from industry and MIT.

A strategy

A clear consensus has emerged from discussions amongst members of our alliance that now is an opportune time for a collective strategy which moves forward aggressively to deploy responsive high-performance computing clusters. This consensus builds on experiences in collaborative computational science efforts at MIT - for example
In all these efforts pioneering work has been boosted through the use of compute resources close to the desktops of individual researchers.

Several high-profile follow-on efforts are now underway that build on these and other successes. These include

In each of these projects a balanced network of responsive high-performance cluster computing resources is to play a significant role.

Planned system architecture

We do not wish to construct a traditional, centralized computing center facility. Centralized facilities suffer tremendously from a lack of interaction and so do not overly encourage responsive feedbacks between leading science users and applications (i.e. using the simulation tools as a numerical laboratory), nor can centralized facilities scale in an unlimited fashion. 

Our plan is to establish a virtual network domain of four "critical-mass" cluster systems together with a distributed network of smaller, feeder systems. A prototype - Hyades - already exists in EAPS which was set up in collaboration with Arvind of LCS. Each "critical-mass" system will contain of order 256 processors, connected by a dedicated high-performance interconnect, such as Myrinet. On one of these systems the high-performance interconnect will also link to a pair of very large memory (>32GB ) servers designed to examine hybrid algorithms that mix abundant parallelism with large-memory footprint sequential processing. The "critical-mass" systems will form the stable core of our facility. Appropriate storage (initially anticipated to be around 10TB per system) and on-loading and off-loading capabilities will be included in these systems.

Feeder systems, with smaller numbers of nodes will also be included within the same domain as the "critical-mass" systems. The feeder systems will allow algorithmic and application development work to be prototyped at smaller scale, and will support educational activities described below. They will also allow evaluation of newer test bed technologies within the overall system. The net result, to an end user, will be a uniquely responsive high-performance "meta-computing" facility with significant capacity and with a rich mix of resources. The facility will be operated in such a way that any idle compute cycles will become available to any member of the facility. This will maintain balanced access to "critical-mass" systems user ranging from high-end researchers to students participating in educational activities.

A technical blueprint of the proposed system is here.

Broader benefits

The project draws on experience of nearly ten years of collaboration between researchers in EAPS and researchers in the Laboratory for Computer Science. These collaborations have involved faculty, graduate and undergraduate level interactions. A fundamental assumption is that as computational horsepower becomes increasingly ubiquitous there is more and more advantage to creating inter-disciplinary links that bridge between advanced science and advanced technology, particularly in the areas of complex system modeling and simulation.

The systems planned under this effort will be actively used in teaching projects that encompass both computer science and physical science. Courses within EAPS will make immediate use of the facility as will the joint MIT/SMA 5505 course, making this activity of particular relevance to the Electrical Engineering and Earth Atmospheric and Planetary Science Departments.

For the wider MIT community our collaborative could serve as a model for delivering responsive computing to undergraduate and graduate students throughout the institute. Enacted on a wide spread basis the approach could be of significant benefit in attracting the best talent to MIT. Finally many of the synthesis, simulation and other software products that have derived from our past collaborative efforts have been adopted by groups around the world. This process will continue under the plan laid out here.

Support sought

In order to construct a state of the art cluster facility we are approaching MIT and Industry to complement the considerable federal resources we have already raised in support of this initiative. The collaborative has already committed $1M that will be spent over the next 3 to 5 years.

 We are asking for support toward:

  1. the costs of physical infrastructure and wide-area networking
  2. compute nodes and interconnect hardware
  3. mass store
  4. seed funds that could make a partial contribution toward the costs of administering the system.

The cost breakdown for construction of a 1000 node distributed cluster is, in broad terms:

compute nodes $1M
fast interconnect $1M
networking $50K to $100K
storage $250K
manpower $250K