Reflections on science, technology, and computing — leavened by personal experience


I am seeking a post-doctoral research associate to join a National Science Foundation (NSF) project on high-performance computing system reliability and energy efficiency modeling. If you are interested, or you know someone who might be interested, please share information on the position.

Performance, delay and parallelism at large scale: Little’s Law speaks to all of these issues and more. When performance optimization, reliability requirements, and energy management are convolved, the constraint-based optimization problems become dauntingly complex.

Cloud services now operate on the largest computing systems we have ever built on this planet, with service reliability expectations far higher than what we demand from scientific applications. Thus, I also believe there are lessons from cloud computing that are potentially applicable to computational science applications.

Evolution or revolution, it’s the persistent question. Can we build reliable exascale systems from extrapolations of current technology or will new approaches be required? There is no definitive answer, as almost any approach might be made to work at some level with enough heroic effort. The bigger question is what design would enable the most…