Tag: reliability

  • Postdoc Position: HPC System Resilience and Energy Efficiency
    Postdoc Position: HPC System Resilience and Energy Efficiency

    I am seeking a post-doctoral research associate to join a National Science Foundation (NSF) project on high-performance computing system reliability and energy efficiency modeling. If you are interested, or you know someone who might be interested, please share information on the position.

  • Little’s Law in the Exascale Era
    Little’s Law in the Exascale Era

    Performance, delay and parallelism at large scale: Little’s Law speaks to all of these issues and more. When performance optimization, reliability requirements, and energy management are convolved, the constraint-based optimization problems become dauntingly complex.

  • Serial Dismay, Parallel Excitement
    Serial Dismay, Parallel Excitement

    Cloud services now operate on the largest computing systems we have ever built on this planet, with service reliability expectations far higher than what we demand from scientific applications. Thus, I also believe there are lessons from cloud computing that are potentially applicable to computational science applications.

  • When Petascale Is Just Too Slow
    When Petascale Is Just Too Slow

    Evolution or revolution, it’s the persistent question. Can we build reliable exascale systems from extrapolations of current technology or will new approaches be required? There is no definitive answer, as almost any approach might be made to work at some level with enough heroic effort. The bigger question is what design would enable the most…