HPC In An AI World

Categories: , , , ,

In 2023, Jack Dongarra, Dennis Gannon, and I published a warning in the Communications of the ACM (CACM), entitled HPC Forecast: Cloudy and Uncertain.  The title was not rhetorical; it reflected an uncomfortable market reality –the center of gravity in computing had shifted from technical high-performance computing (HPC) to the hyperscalers and smartphone giants.

The article built on two earlier essays I wrote in 2022 and 2023: American Competitiveness: IT and HPC Futures – Follow the Money and the Talent and Computing Futures: Technical, Economic, and Geopolitical Challenges Ahead. All three argued the same point: HPC no longer sets the agenda for high-end computing. Others do. 

Since 2023, the shift has not merely continued; it has become seismic. Generative AI has reordered the computing landscape along three axes:

Capital. AI investment is now a game played in the hundreds of billions, a scale that challenges even the largest national governments.

Energy. AI data center demands are measured in gigawatts—comparable to the output of mid-sized nations and exposing the fragility and limits of aging power grids.

Market Capitalization. NVIDIA and the hyperscalers now hold multi-trillion-dollar seats at the geopolitical table.

The scientific and engineering community is no longer the primary driver of high-end silicon; we are an accidental passenger. To flourish, scientific computing must adapt to seven new realities, which we have outlined in a new article entitled, Ride the Wave, Build the Future: Scientific Computing in an AI World, which we have submitted for publication. (Dennis and Jack have also posted it on their web sites.)

N.B. I am deeply grateful to Jack and Dennis. Their thinking permeates both the original maxims and the new ones. Any errors of interpretation or emphasis are mine alone.

Looking Back: The 2023 Maxims

The following five principles defined our 2023 outlook. Let’s look at how they have aged in the face of the AI revolution.

Maxim One: Semiconductor constraints dictate new approaches. The classical realization of Moore’s Law—automatic, roughly 2X improvements in general-purpose performance every few years—had clearly ended. Semiconductor process improvements will continue, but frequency scaling stalled, Dennard scaling ended, and each new reduction in semiconductor feature sizes delivers smaller, more expensive, and more specialized gains. 

Outcome: The cost of new semiconductor fabrication facilities, the rising complexity of EUV-based processes, and the engineering effort required for chiplet-based systems have all grown faster than even the most optimistic projections. Only a handful of firms and national consortia can afford to operate at the true frontier, and their business models are overwhelmingly dominated by either hyperscale cloud and AI customers or smartphone manufacturers.

Maxim Two: End-to-end hardware/software co-design is essential. Systems are best optimized by hardware, system software, algorithms, and applications evolving together. Instead, what looks like “Moore’s Law” is increasingly simulated by architectural and packaging mechanisms: massive parallelism, accelerators, deep memory hierarchies, and multiple chip packages (aka “chiplets).

Outcome: In generative AI and cloud computing, we have seen genuine co-design successes. Modern accelerators and AI ASICs are shaped in close dialogue with a small set of dominant workloads: transformer models, recommendation engines, cloud infrastructure accelerators, and related kernels. Every hyperscaler AI vendor has designed and fabricated custom solutions to improve performance, increase reliability, and decrease costs.  One need look no further than Google’s Tensor Processing Units (TPUs), Amazon’s Trainium, or Microsoft’s Maia chips to see those successes.  

In scientific computing, the story is more dire. Despite years of workshops and roadmaps, most scientific codes still run on hardware designed for other priorities.   Put another way, as development costs have risen and other markets have grown in economic importance, the HPC world is increasingly unable to compete; we in HPC are poor and our market is small.

Maxim Three: Prototyping at scale is required to test new ideas. To truly test novel ideas requires integrated hardware and software with sufficient capability that one can validate design ideas with application domain problems. Implicitly, this means accepting the risk of failure, drawing insights from the failure, and building lessons based on those insights.

Outcome:  Pre-production exascale nodes, AI-focused testbeds, and experimental racks with novel cooling, memory, or interconnect technologies have been deployed at a few leading HPC centers. However, these efforts remain scattered and, in many cases, closed or narrowly scoped. In contrast, the hyperscaler AI community regularly builds and tests new prototypes. During the interval the U.S. exascale computing project (ECP)  was underway, Google built seven generations of its Tensor Processing Units (TPUs).

Maxim Four: The space of leading edge HPC applications is far broader now than in the past. Data intensive computing and AI applications, along with complex workflows, will become a large part of scientific computing.

Outcome: Today, technical computing infrastructure supports societal challenges writ large. Hybrid models – physics-based simulations and AI surrogates – are now common. The same computing systems that run multi-petabyte turbulence simulations also host workflows for wildfire prediction, urban climate adaptation, supply-chain resilience, and disease surveillance.

Maxim Five: Cloud economics changed the supply-chain ecosystem. The dominant computing markets are hyperscale cloud platforms and smartphones. Consequently, the biggest bets in semiconductor design and manufacturing will be placed on workloads that monetize user behavior and AI services at planetary scale.

Outcome: The market follows the money. Generative AI has become the dominant organizing principle of advanced computing. Hyperscale AI datacenters—built around foundation models, retrieval-augmented workflows, and AI agents—are now the largest and most aggressive consumers of cutting-edge silicon. Their scale and economics set the context within which all other high-end computing, including scientific HPC, now operate.  Equally worrisome, this shift is increasingly manifest with an AI emphasis on reduced precision arithmetic.

The Takeaway: Against the backdrop of the generative AI revolution, all five maxims proved true.  In today’s world, the fundamental unit of design is now a tightly integrated system-on-rack, with cooling, power delivery, and mechanical layout optimized accordingly.

2026 and Beyond

Much has changed since 2022-2023.  Exascale systems are now in production, and the generative AI revolution is reshaping almost everything we have long held dear in advanced computing.  In our new paper, Ride the Wave, Build the Future: Scientific Computing in an AI World, we describe seven new maxims that reflect the reality of high-performance computing in the brave new world of hyperscale generative AI.  I encourage you to read the full paper for details, but here is the high level summary.  What that background, here are my comments on the new maxims.

New Maxim One: HPC is now synonymous with integrated numerical modeling and generative AI.  The “deductive world” of physics-based mathematical models, though still important, is increasingly being augmented with “inductive world” surrogate models created via generative AI.  These hybrid models allow more rapid exploration of rich parameter spaces, as long as the modeler remains cognizant of their domain of applicability.

New Maxim Two:  Energy and data movement, not floating point operations, are the scarce resources. We are still using algorithmic frameworks designed for a world where arithmetic operations were expensive and moving bits and bytes was cheap. Although not free, arithmetic operations are increasingly a de minimis fraction of computing energy budgets.  Meanwhile, as system scales have grown at staggering rates, aggregate energy and cooling costs have grown commensurately. New models and new architectures are desperately needed if we are to build systems with higher performance that remain environmentally friendly and economically practical.

New Maxim Three: Benchmarks are mirrors, not levers.  Benchmarks rarely drive technical change.  Instead, they are snapshots of past and current reality, highlighting progress (or the lack thereof), but they have little power to influence strategic directions.  In a world dominated by sparse data, reduced floating point precision, and AI-driven workflows, aging benchmarks like the TOP500 list, which measures dense matrix floating point operations, are a look toward the past rather than the future.

New Maxim Four: Winning systems are co-designed end-to-end—workflow first, parts list second. Winning systems are no longer built by picking components from a catalog; they are co-designed end-to-end.  Such co-design at scale requires sustained funding and the ability to make bets on uncertain outcomes. It also means looking deeply at application requirements, then asking what hardware and software are needed to support those requirements.  This is possible in the hyperscale AI world, but difficult in the fragmented funding world of high-performance computing. In HPC, we must pivot to funding sustained co-design ecosystems that bet on specific, high-impact scientific workflows

New Maxim Five: Research requires prototyping at scale (and risking failure), otherwise it is procurement. A variant of our 2023 maxim, prototyping – testing new and novel ideas – means accepting the risk of failure, otherwise it is simply incremental development.  Implicit in the notion of prototyping is the need to test multiple ideas, then harvest the ones with promise. Remember, a prototype that cannot fail has another name – it’s called a product.

New Maxim Six: Data and models are intellectual gold. The quality of foundation AI models ultimately rests on the volume and quality of the underlying training data. All too often, in scientific computing, our gold is buried in disparate, multi-disciplinary datasets. This needs to change; we must build  sustainable, multidisciplinary data fusion.  Only with such partnerships can we construct the hybrid models needed to address critical societal problems.

New Maxim Seven: New collaborative models define 21st-century computing. Frontier AI+HPC has moved from the realm of research strategy to national geopolitical policy.  Advanced computing is now an instrument of national sovereignty and economic security. This has profound implications for how governments fund advanced computing research and development. Governments must now treat advanced computing as a strategic utility, requiring a scale of coordination and investment that rivals the Manhattan Project or the Apollo program.

A 21st Century Moonshot

The history of advanced computing is one of punctuated equilibria — periods of technological stability shattered by economic and technological shifts. From the early days of custom vector supercomputers through RISC-based shared memory multiprocessors to today’s accelerator-enabled massive clusters, the ecosystem has shifted dramatically.  We are amidst a tsunami of change – technologically, economically, and geopolitically.

The old models of government investment are still necessary, but they are no longer sufficient. We face profound technological challenges in energy efficiency and data movement.  It’s time – past time – for both new strategies and new technical innovations.  We must ride the wave and build the future.

Apollo Astronaut on the Moon

We need a moonshot that rebuilds our core computing infrastructure based on 21st century ideas, not just variants of those from the past century. It will not be for timid, the nostalgic, or the underfunded. Bold ideas and new approaches never are.


Discover more from Reed's Ruminations: The Past, Present, and Future

Subscribe to get the latest posts sent to your email.

Please leave a comment …

Discover more from Reed's Ruminations: The Past, Present, and Future

Subscribe now to keep reading and get access to the full archive.

Continue reading