A proposal for an
ACES
prepared by John Marshall and Chris Hill
Earth, Atmospheric and Planetary Sciences, MIT
A group of Earth scientists, engineers and computer scientists from the Schools of Science and Engineering, propose to set up a virtual network of four "critical-mass" cluster computer systems together with a distributed network of smaller 'feeder' systems. The resource would be used to tackle a number of key problems in Earth science using common computation techniques and a shared resource. The clusters will have a common design and could be used independently or as one. The project leverages off previous collaborations between computer scientists, physical scientists and industry at MIT in the use of cluster computing in scientific computation. The shared computer resource could serve a model of collaborative, cost effective, high performance computing at MIT.
A more detailed description of our goals and future plans can be found here. A picture of the compute clusters and grid that we are planning to set up is here.
The group - the Alliance for Computational Earth Science (ACES) - have raised ~ $1m in support of the initiative. We are now actively seeking the support of MIT and industry to bring the project to fruition.
The intellectual foci of the
· Computational technologies for understanding the state of the planet and its evolution over time, both in the past and in to the future.
o Algorithms and systems for planetary scale phenomena, including physical, biological and chemical phenomena.
· Computational technologies for monitoring the planet.
o Real-time and historic synthesis of planetary scale in-situ and space-borne observations together with models. Continuous monitoring of the Earth system state, reconstructions of past records.
· Computational technologies for educating students about planetary systems.
o Combined laboratory-sensor-computation hands-on teaching program with scale-up to planetary problems.
The Earth - its atmosphere, oceans, cryosphere, land surface and interior - is continually monitored by numerous arrays of sophisticated sensor networks. Space borne and in-situ sensors yield a real-time view of the state of the entire planet that is unprecedented in accuracy and in raw detail. Making sense of this wealth of information and distilling clear, quantitative inferences from it, is a formidable intellectual and computational challenge.
A close integration between measurement, modeling, computation and theory is required. Not only must forward models of the earth system be developed which encode our understanding of the laws that govern the evolving planet, but we must also develop tools to quantify, and take account of, errors in these models. Moreover, computer models provide the natural conduit through which prior knowledge and theoretical understanding of the evolving system, can to be used to synthesize multiple data streams. The resulting combination of models constrained by observations provides not only a powerful tool to probe the workings of the earth as it is today, but how it may have been in the past and may evolve in the future.
A group of scientists from the School of Science and the School of Engineering at MIT have recently been meeting to compare notes on coordinated computational strategies in this broad area of model/data synthesis. It has become clear that, although the scientific focus varies widely between, say, seismic imaging and paleo-climate reconstructions, there are a number of common computational challenges and solutions that act as a unifying theme. One computational strategy that several groups at MIT have gravitated towards are local computing clusters made up of commodity microprocessors. These systems can deliver high-performance, cost-effective interactive compute resources to the fingertips of researchers and so accelerate innovation.
Our emerging alliance has identified a common, but distributed computational approach. Our plan is to set up a virtual network of four "critical-mass" cluster systems together with a distributed network of smaller, feeder systems. They would have a common design and could be used together or as one.
Here we briefly
set out our strategy
describe the proposed architecture of the distributed system
outline the benefits
summarize the costs.
Members of the alliance are listed here. The
cluster systems presently being operated by them are here.
A technical blueprint of the proposed system can be found here.
By pooling resources ACES has of order $1m of funds committed over the next three years to set up this system. We are now actively seeking further support from industry and MIT.
In all these efforts pioneering work has been boosted through the use of compute resources close to the desktops of individual researchers.
- John Marshall and Chris Hill have long-standing collaborations with LCS and industrial partners through the Xolas (with Sun), Plaeides (with Compaq) and Hyades (with Intel) projects. They have pioneered the use of cluster computing in atmosphere-ocean applications in which the power of many commodity processors (PC's and workstations) are harnessed to carry out large parallel computations. In collaboration with LCS researchers Arvind and James Hoe they have mapped out the technical landscape required for enabling high-performance, low-cost models of the fluid Earth.
- The ocean-atmosphere model, MITgcm, developed in MIT's Climate Modeling Initiative, has been designed to be compatible with advanced data synthesis applications, and is now being widely used here at MIT and by groups at other US universities and research institutions, at NASA laboratories and at overseas institutions. The model has been adopted as part of the ongoing ECCO project involving scientists at MIT, Scripps Institution of Oceanography and JPL. Several groups in our consortium are making use of local, cluster computing resources.
- The MIT Center for Global Change Science has been employing ensemble based approaches to help put bounds on the probabilities of particular climate change scenarios. This effort has employed ensemble based approaches to computationally sample a broad span of parameter space. The work employs a cluster system.
- The MIT geophysics group is active in high-accuracy monitoring, using GPS and InSAR, of plate movements and crustal motions caused by strain accumulation and release on earthquake faults and motions of magma in volcanic areas. (see http://bowie.mit.edu/~tah/#_Research). These and other observations are compared to predictions of numerical models of stress and strain transfer through the earthquake cycle in order to place constraints on the physics of earthquakes. Numerical models are also used in simulations of mantle convection, plate motions, and mountain building on Earth and the other planets and their moons (http://www-geodyn.mit.edu/research.html).
Several high-profile follow-on efforts are now underway that build on these and other successes. These include
In each of these projects a balanced network of responsive high-performance cluster computing resources is to play a significant role.
- The NASA funded national "Earth System Modeling Framework (ESMF)" effort being led at MIT. ESMF seeks to provide low and high level software infrastructure on which future generation Earth system models will be constructed. Responsive, cost-effective computing resources for modeling and monitoring the Earth system is a significant element of the project focus at MIT.
- The National Science Foundation funded project "Bio-complexity: Feedback between Ecosystems and the Climate" is applying ensemble based methods for determining parametric sensitivity of the climate system. These approaches require multiple concurrent parallel computations.
- The Information Technology Research project "An Ensemble Approach to Data Assimilation in Earth Sciences" is developing advanced methods for next generation Earth system assimilation systems. This project is planning to explore responsive hardware configurations based on a hybrid architecture of large processor cluster technologies tightly integrated with fewer processor, large-memory systems.
- Automatic Differentiation(AD): AD refers to source-to-source translation to synthesize code for computing partial derivatives from existing numerical code. ACEs team members have been pioneers in the application of AD to ocean state-estimation, where compositions of adjoints, backwards in time, are necessary to solve the problem. Another NSF funded ITR project (the Adjoint Compiler Toolkis and Standards Project) to develop next-generation adjoint compilers is being led by ACE members.
We do not wish to construct a traditional, centralized computing center facility. Centralized facilities suffer tremendously from a lack of interaction and so do not overly encourage responsive feedbacks between leading science users and applications (i.e. using the simulation tools as a numerical laboratory), nor can centralized facilities scale in an unlimited fashion.
Our plan is to establish a virtual network domain of four "critical-mass" cluster systems together with a distributed network of smaller, feeder systems. A prototype - Hyades - already exists in EAPS which was set up in collaboration with Arvind of LCS. Each "critical-mass" system will contain of order 256 processors, connected by a dedicated high-performance interconnect, such as Myrinet. On one of these systems the high-performance interconnect will also link to a pair of very large memory (>32GB ) servers designed to examine hybrid algorithms that mix abundant parallelism with large-memory footprint sequential processing. The "critical-mass" systems will form the stable core of our facility. Appropriate storage (initially anticipated to be around 10TB per system) and on-loading and off-loading capabilities will be included in these systems.Feeder systems, with smaller numbers of nodes will also be included within the same domain as the "critical-mass" systems. The feeder systems will allow algorithmic and application development work to be prototyped at smaller scale, and will support educational activities described below. They will also allow evaluation of newer test bed technologies within the overall system. The net result, to an end user, will be a uniquely responsive high-performance "meta-computing" facility with significant capacity and with a rich mix of resources. The facility will be operated in such a way that any idle compute cycles will become available to any member of the facility. This will maintain balanced access to "critical-mass" systems user ranging from high-end researchers to students participating in educational activities.
A technical blueprint of the proposed system is here.
The project draws on experience of nearly ten years of collaboration between researchers in EAPS and researchers in the Laboratory for Computer Science. These collaborations have involved faculty, graduate and undergraduate level interactions. A fundamental assumption is that as computational horsepower becomes increasingly ubiquitous there is more and more advantage to creating inter-disciplinary links that bridge between advanced science and advanced technology, particularly in the areas of complex system modeling and simulation.
The systems planned under this effort will be actively used in teaching projects that encompass both computer science and physical science. Courses within EAPS will make immediate use of the facility as will the joint MIT/SMA 5505 course, making this activity of particular relevance to the Electrical Engineering and Earth Atmospheric and Planetary Science Departments.
For the wider MIT community our collaborative could serve as a model for delivering responsive computing to undergraduate and graduate students throughout the institute. Enacted on a wide spread basis the approach could be of significant benefit in attracting the best talent to MIT. Finally many of the synthesis, simulation and other software products that have derived from our past collaborative efforts have been adopted by groups around the world. This process will continue under the plan laid out here.
In order to construct a state of the art cluster facility we are approaching MIT and Industry to complement the considerable federal resources we have already raised in support of this initiative. The collaborative has already committed $1M that will be spent over the next 3 to 5 years.
We are asking for support toward:
The cost breakdown for construction of a 1000 node distributed cluster is, in broad terms:
| compute nodes | $1M |
| fast interconnect | $1M |
| networking | $50K to $100K |
| storage | $250K |
| manpower | $250K |