Published: Apr 22, 2025
Past, Present & Future of AI Compute

What if CUDA wasn’t the end of the story—but just the beginning?
At this Beyond CUDA Summit panel, legends from the early CUDA era, founders of next-gen AI hardware startups, and leaders from stealth-mode innovators shared an unfiltered, behind-the-scenes look at:
- The birth of CUDA and how it changed everything
- What’s really behind NVIDIA’s moat
- Where the next wave of compute is heading
- How LLMs, new accelerators, and dev tools are reshaping the future
Let’s dive into the insights, debates, and raw truth from a stacked panel featuring:
Greg Diamos, Nicholas Wilt, Micah Villmow, and Davor Capalija.
📜 CUDA’s Birth: From Research Project to Industry Standard
The story starts in 2005—Greg Diamos and Nicholas Wilt were there.
- CUDA began as a radical idea inside NVIDIA
- Greg’s team built Ocelot, one of the earliest tools that allowed CUDA to run across multiple architectures
- Nicholas authored the CUDA Handbook and led architecture of the driver API—laying the groundwork for its portability and dominance
Their reflections?
CUDA wasn’t just a programming model. It was a strategic software-hardware bet—and it paid off.
💣 Why CUDA Became NVIDIA’s Moat
So what makes CUDA so hard to dethrone?
Four key ingredients:
- Software-Hardware Co-Design
You can’t just copy CUDA—you have to recreate a synchronized hardware stack too. - Peak Performance Across Workloads
Libraries like CUTLASS and cuDNN squeeze every ounce of compute from every chip. - Portability & Durability
From data centers to research clusters, CUDA runs everywhere—and stays fast. - Developer Ecosystem & Education
Entire generations of engineers learned CUDA in school—and took it with them into startups and enterprises.
🧭 The Future: Opportunities to Go Beyond CUDA
But as powerful as CUDA is, the panelists made one thing clear:
Its dominance has limits—and the window for innovation is open.
Here’s where they see opportunity:
1. Usability & Developer Experience
“New devs aren’t learning C++. We need Python-first, LLM-friendly, user-centric tools.”
2. New AI Models, New Requirements
CUDA was built for tensor cores and matrix math. But future models (like sparsity-based, attention-free architectures) may need different optimizations.
3. Generalization Across Hardware
PyTorch, JAX, and compilers like TorchInductor are abstracting away CUDA. That abstraction = opportunity for other hardware to compete.
💡 Key Insight: CUDA Is Now a Whole Stack, Not Just a Language
Today, CUDA includes:
- Tensor cores
- NVLink / NVSwitch
- cuDNN / CUTLASS
- Multi-node training fabric
- Compilers and runtime glue
As one panelist said:
“CUDA isn’t just a programming model anymore—it’s the kitchen sink.”
This complexity opens the door for cleaner, simpler alternatives that better serve new generations of developers and use cases.
📣 Panelist Takeaways
🎙 Micah Vilmo:
“We don’t need another CUDA clone. We need a better experience for developers.”
🎙 Davor Capalija:
“The future isn’t singular. Diverse architectures and programming models will win.”
🎙 Greg Diamos:
“CUDA’s pillars—tensor cores, massive parallelism, NVLink—are nearing saturation. New models need new hardware.”
🎙 Nicholas Wilt:
“CUDA’s flexibility made NVIDIA bold. But even that won’t last forever.”
🔮 Final Thoughts: The Bridge to the Future
The panel ended with one big question from the audience:
How do we unify the software stack across the next wave of AI accelerators?
Answer: We don’t yet.
But LLM-based dev tools, Python-native workflows, and open compiler stacks are moving fast—and they might just build the bridge.
📺 Watch the Full Panel
👉 Past, Present & Future of AI Compute | Beyond CUDA Summit 2025
🚀 Run Efficient Models on AMD GPUs
Deploy your optimized models on TensorWave’s AMD-powered AI cloud—built for training, inference, and experimentation at scale on MI300X and MI325X GPUs.
About TensorWave
TensorWave is the AI and HPC cloud purpose-built for performance. Powered exclusively by AMD Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.
Ready to get started? Connect with a Sales Engineer.