Published: Aug 05, 2025

Beyond CUDA with Gregory Diamos: Why AMD Is Poised to Lead AI’s Next Era

What happens when one of CUDA’s original architects says it’s time to move beyond it?

In this episode of Beyond CUDA, host Jeff Tatarchuk (Co-Founder & Chief GPU Officer at TensorWave) sits down with Gregory Diamos, a legend in the GPU world whose fingerprints are on CUDA, MLPerf, and now, open-source large model training at ScalarLM.

From early days at NVIDIA to building the world’s first large-scale CUDA cluster, Greg shares why the AI world’s dependence on CUDA might be nearing a tipping point. He explains why AMD is uniquely positioned to challenge the CUDA moat, why memory-rich GPUs change the economics of inference, and how projects like ScalarLM aim to make open, vendor-neutral training frameworks actually viable at scale.

Whether you’re a systems architect, model developer, or just CUDA-curious, this one’s for you.

Episode Breakdown: AMD, Open Ecosystems, and ScalarLM

🔧 CUDA’s True Origin Story

  • CUDA wasn’t supposed to be the endgame, it was designed to unlock a new class of computing.
  • Diamos worked on the original CUDA team and shares behind-the-scenes stories of building performance from scratch – shared memory, assembly kernels, and all.

🧱 Why CUDA Became the Moat

  • Performance was king but flexibility was key.
  • CUDA scaled because it served everything from physics simulations to LLMs.
  • The moat isn’t the hardware. It’s the developer ecosystem, tooling, and inertia.

🚀 Why AMD Has a Shot

  • AMD’s MI300X/MI325X GPUs have massive memory (256GB HBM3e) and strong architectural similarities to NVIDIA, making porting more realistic.
  • ROCm has matured: Inference workloads “just work,” and Diamos notes AMD has “thousands of engineers” closing the remaining gaps.

🔁 The Training vs. Inference Split Is a Mirage

  • Conway’s Law drove the separation of training and inference stacks.
  • ScalarLM’s mission: unify them, and make open training at scale easy on AMD.

🛠 What’s ScalarLM Building?

  • A Megatron-style, vendor-neutral framework for training and inference on AMD GPUs.
  • Single-GPU to multi-node, with a roadmap to run 1,000-GPU LLaMA-style workloads in 50 lines of code.
  • Built on top of VLM, targeting simplicity and performance without lock-in.

TL;DR

  • CUDA’s moat is software, not silicon… and that moat is shrinking.
  • AMD’s hardware is ready (256GB VRAM, strong inference performance).
  • ScalarLM is filling the software gap with an open-source training framework designed for AMD.

Unified training + inference stacks aren’t controversial, they’re efficient.

About TensorWave

TensorWave is the AMD GPU cloud purpose-built for performance. Powered exclusively by Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.

Ready to get started? Connect with a Sales Engineer.