AI Today and Tomorrow Series #3: HPC and AI—When Worlds Converge/Collide

Welcome to the third entry in this series on AI. The first one was an introduction and series overview and the next discussed the aspirational goal of artificial general intelligence, AGI. Now it’s time to zero in on another timely topic—HPC users’ reactions to the convergence of HPC and AI.

Much of this content is supported by our in-depth interviews at Intersect360 Research with HPC and AI leaders around the world. As I said in the intro column, the series doesn’t aim to be definitive. The goal is to lay out a range of current information and opinions on AI for the HPC-AI community to consider. It’s early and no one has the final take on AI. Comments are always welcome at steve@intersect360.com.

AI Relies Heavily on HPC Infrastructure and Talent

HPC and AI are symbiotes, creations locked in a tight, mutually beneficial relationship. Both live on a similar, HPC-derived infrastructure and continually exchange advances—siblings maintaining close contact.

HPC infrastructure enables the AI community to develop sophisticated algorithms and models, accelerate training and perform rapid analysis in solo and collaborative environments.
Shared infrastructure elements originating in HPC include standards-based clusters, message-passing (MPI and derivatives), high-radix networking technologies, storage and cooling technologies, to name a few. MPI “forks” used in AI (e.g., MPI-Bcst, MPIAllreduce, MPI_Scatterv/Gatherv) provide useful functions well beyond basic interprocessor communication.

Oak Ridge National Lab’s Frontier, the world’s second-fastest supercomputer (Image courtesy HPE)
But HPC’s greatest gift to AI is decades of experience with parallelism—especially useful now that Moore’s Law-driven progress in single-threaded processor performance has sharply decelerated.

The infrastructure overlap runs deep. Not long ago, a successful designer of interconnect networks for leadership-class supercomputers was hired by a hyperscale AI leader to redesign the company’s global network. I asked him how different the supercomputer and hyperscale development tasks are. He said: “Not much. The principles are the same.”

This anecdote illustrates another major HPC contribution to the mainstream AI world–cloud services providers, social media and other hyperscale companies: talented people who adapt needed elements of the HPC ecosystem to hyperscale environments. During the past decade, this talent migration has helped fuel the growth of the mainstream AI market—even as other talented people stayed put to advance leading-edge, “frontier AI” within the HPC community.

HPC and Hyperscale AI: The Data Difference

Social media giants and other hyperscalers were in a natural position to get the AI ball rolling in a serious way. They had lots of readily available customer data for exploiting AI. In sharp contrast, some economically important HPC domains, such as healthcare, still struggle to collect enough usable, high-quality data to train large language models and extract new insights.

It’s no accident, for example, that UnitedHealth Group reportedly spent $500 million on a new facility in Cambridge, Massachusetts, where tech-driven subsidiary Optum Labs and partners including the Mayo Clinic and Johns Hopkins University can pool data resources and expertise to exploit frontier AI. The Optum collaborators now have access to usable (deidentified, HIPAA-compliant) data on more than 300 million patients and medical enrollees. An important aim is for HPC and AI to partner in precision medicine, by making it possible to quickly sift through millions of archived patient records to identify treatments that have had the best success for patients closely resembling the patient under investigation.

(Panchenko Vladimir/Shutterstock)

The pharmaceutical industry also has a shortage of usable data for some important purposes. One pharma exec told me that the supply of usable, high-quality data is “miniscule” compared with what’s really needed for precision medicine research. The data shortage issue extends to other economically important HPC-AI domains, such as manufacturing. Here, the shortage of usable data may be due to isolation in data silos (e.g., supply chains), lack of standardization, or simple scarcity.

This can have consequences for everything from HPC-supported product development to predictive maintenance and quality control.

Addressing the Data Shortage

The HPC-AI community is working to remedy the data shortage in multiple ways:

A growing ecosystem of organizations is creating realistic synthetic data, which promises to expand data availability while providing better privacy protection and avoidance of bias.
The community is developing better inferencing—guessing ability. Bigger inferencing “brains” should produce desired models and solutions with less training data. It’s easier to train a human than a chimpanzee to “go to the nearest grocery store and bring back a quart of milk.”
The recent DeepSeek news showed, among other things, that impressive AI results can be achieved with smaller, less-generalized (more domain-specific) models that require less training data—along with less time, money and energy use. Some experts argue that multiple small language models (SLMs) are likely to be more effective than one large language model (LLM).

Beneficial Convergence or Scary Collision?

Attitudes of HPC center directors and leading users toward the HPC-AI convergence differ greatly. All expect mainstream AI to have a powerful impact on HPC, but expectations range from confident optimism to varying degrees of pessimism.

The optimists point out that the HPC community has successfully managed challenging, ultimately beneficial shifts before, such as migrating apps from vector processors to x86 CPUs, moving from proprietary operating systems to Linux, and adding cloud computing to their environments. The community is already putting AI to good use and will adapt as needed, they say, even though changing will require another major effort. More good things will come from this convergence. Some HPC sites are already far along in exploiting AI to support key applications.

The virtuous cycle of HPC, big data, and AI (Inkoly/Shutterstock)

The pessimists tend to fear the HPC-AI convergence as a collision, where the large mainstream AI market overwhelms the smaller HPC market, forcing scientific researchers and other HPC users to do their work on processors and systems optimized for mainstream AI and not for advanced, physics-based simulation. There is reason for concern, although HPC users have had to turn to mainstream IT markets for technology in the past. As someone pointed out in panel session on future processor architectures I chaired at the recent EuroHPC Summit in Krakow, the HPC market has never been big enough financially to have its own processor and has had to borrow more economical processors from larger, mainstream IT markets—especially x86 CPUs and then GPUs.

Concerns That May Keep Optimists and Pessimists Up at Night

Here are things in the HPC-AI convergence that seem to concern optimists and pessimists alike:

Inadequate access to GPUs. GPUs have been in short supply. A concern is that the superior purchasing power of hyperscalers—the biggest customers for GPUs—may make it difficult for Nvidia, AMD and others to justify accepting orders from the HPC community.
Pressure to Overbuy GPUs. Some HPC data center directors, especially in the government sector, told us that AI “hype” is so strong that their proposals for next-generation supercomputers had to be replete with mentions of AI. This later forced them to follow through and buy more GPUs—and fewer CPUs—that their user community needed.
Difficulty Negotiating System Prices. More than one HPC data center director reported that, given the GPU shortage and the superior purchasing power of hyperscalers, vendors of GPU-centric HPC systems have become reluctant to enter into customary price negotiations with them.
Continuing Availability of FP64. Some HPC data center directors say they’ve been unable to get assurance that FP64 units will be available for their next supercomputers several years from now. Double precision isn’t essential for many mainstream AI workloads and vendors are developing smart algorithms and software emulators aimed at producing FP64-like results run at lower or mixed precision.

Preliminary Conclusion

It’s early in the game and already clear that AI is here to stay—not another “AI winter.” Similarly, nothing is going to stop the HPC-AI convergence. Even pessimists foresee strong benefits for the HPC community from this powerful trend. HPC users in government and academic settings are moving full speed ahead with AI research and innovation, while HPC-reliant industrial firms are predictably more cautious but already have applications in mind. Oil and gas majors, for example, are starting to apply AI in alternative energy research. The airline industry tells us AI won’t replace pilots in the foreseeable future, but with today’s global pilot shortage some cockpit tasks can probably be safely offloaded to AI. There are some real concerns as noted above, but most HPC community members we talk with believe that the HPC-AI convergence is inevitable, it will bring benefits and the HPC community will adapt to this shift as it has to prior transitions.

BigDATAwire contributing editor Steve Conway’ s day job is as senior analyst with Intersect360 Research. Steve has closely tracked AI developments for over a decade, leading HPC and AI studies for government agencies around the world, co-authoring with Johns Hopkins University Advanced Physics Laboratory (JHUAPL) an AI primer for senior U.S. military leaders and speaking frequently on AI and related topics