TensorWave Welcomes the AMD Instinct™ MI355X

Published: Apr 22, 2025

Past, Present & Future of AI Compute

What if CUDA wasn’t the end of the story—but just the beginning?
At this Beyond CUDA Summit panel, legends from the early CUDA era, founders of next-gen AI hardware startups, and leaders from stealth-mode innovators shared an unfiltered, behind-the-scenes look at:

The birth of CUDA and how it changed everything
What’s really behind NVIDIA’s moat
Where the next wave of compute is heading
How LLMs, new accelerators, and dev tools are reshaping the future

Let’s dive into the insights, debates, and raw truth from a stacked panel featuring:
Greg Diamos, Nicholas Wilt, Micah Villmow, and Davor Capalija.

📜 CUDA’s Birth: From Research Project to Industry Standard

The story starts in 2005—Greg Diamos and Nicholas Wilt were there.

CUDA began as a radical idea inside NVIDIA
Greg’s team built Ocelot, one of the earliest tools that allowed CUDA to run across multiple architectures
Nicholas authored the CUDA Handbook and led architecture of the driver API—laying the groundwork for its portability and dominance

Their reflections?
CUDA wasn’t just a programming model. It was a strategic software-hardware bet—and it paid off.

💣 Why CUDA Became NVIDIA’s Moat

So what makes CUDA so hard to dethrone?

Four key ingredients:

Software-Hardware Co-Design
You can’t just copy CUDA—you have to recreate a synchronized hardware stack too.
Peak Performance Across Workloads
Libraries like CUTLASS and cuDNN squeeze every ounce of compute from every chip.
Portability & Durability
From data centers to research clusters, CUDA runs everywhere—and stays fast.
Developer Ecosystem & Education
Entire generations of engineers learned CUDA in school—and took it with them into startups and enterprises.

🧭 The Future: Opportunities to Go Beyond CUDA

But as powerful as CUDA is, the panelists made one thing clear:
Its dominance has limits—and the window for innovation is open.

Here’s where they see opportunity:

1. Usability & Developer Experience

“New devs aren’t learning C++. We need Python-first, LLM-friendly, user-centric tools.”

2. New AI Models, New Requirements

CUDA was built for tensor cores and matrix math. But future models (like sparsity-based, attention-free architectures) may need different optimizations.

3. Generalization Across Hardware

PyTorch, JAX, and compilers like TorchInductor are abstracting away CUDA. That abstraction = opportunity for other hardware to compete.

💡 Key Insight: CUDA Is Now a Whole Stack, Not Just a Language

Today, CUDA includes:

Tensor cores
NVLink / NVSwitch
cuDNN / CUTLASS
Multi-node training fabric
Compilers and runtime glue

As one panelist said:

“CUDA isn’t just a programming model anymore—it’s the kitchen sink.”

This complexity opens the door for cleaner, simpler alternatives that better serve new generations of developers and use cases.

📣 Panelist Takeaways

🎙 Micah Vilmo:

“We don’t need another CUDA clone. We need a better experience for developers.”

🎙 Davor Capalija:

“The future isn’t singular. Diverse architectures and programming models will win.”

🎙 Greg Diamos:

“CUDA’s pillars—tensor cores, massive parallelism, NVLink—are nearing saturation. New models need new hardware.”

🎙 Nicholas Wilt:

“CUDA’s flexibility made NVIDIA bold. But even that won’t last forever.”

🔮 Final Thoughts: The Bridge to the Future

The panel ended with one big question from the audience:
How do we unify the software stack across the next wave of AI accelerators?

Answer: We don’t yet.
But LLM-based dev tools, Python-native workflows, and open compiler stacks are moving fast—and they might just build the bridge.

📺 Watch the Full Panel
👉 Past, Present & Future of AI Compute | Beyond CUDA Summit 2025

🚀 Run Efficient Models on AMD GPUs

Deploy your optimized models on TensorWave’s AMD-powered AI cloud—built for training, inference, and experimentation at scale on MI300X and MI325X GPUs.

About TensorWave

TensorWave is the AI and HPC cloud purpose-built for performance. Powered exclusively by AMD Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.

Ready to get started? Connect with a Sales Engineer.

SOC2 Type II certified and HIPAA compliant

Subscribe to our Blog

Stay ahead of the curve with the latest in AI, AMD accelerators, and all things TensorWave.

SOC2 Type II certified and HIPAA compliant

TensorWave Welcomes the AMD Instinct™ MI355X

Past, Present & Future of AI Compute

🚀 Run Efficient Models on AMD GPUs

About TensorWave

Subscribe to our Blog

Stay ahead of the curve with the latest in AI, AMD accelerators, and all things TensorWave.

Product

Solutions

Resources

Company

© 2025 TensorWave Inc. - All rights reserved.