Published: Apr 22, 2025

Scale by Spectral Compute: Run Your CUDA Workloads Faster on Affordable AMD GPUs

When it comes to GPU computing, one name has dominated the landscape for over a decade: NVIDIA and its CUDA platform. But what if you could break free from vendor lock-in — without rewriting a single line of code?

At the 2025 Beyond CUDA Summit, Spectral Compute made waves by unveiling Scale, a revolutionary new technology that allows developers to run native CUDA applications on AMD GPUs — faster, cheaper, and without compromise.

By compiling directly to AMD hardware, Scale unlocks a future where AI builders, HPC engineers, and researchers can double their GPU options, cut costs, and accelerate performance — all while staying inside the familiar CUDA ecosystem they've relied on for years.

Here’s what you need to know about this groundbreaking announcement — and why it’s set to change the future of GPU computing.

🔧 The Problem: CUDA Lock-In and Limited Hardware Options

Michael opened by addressing the elephant in the room: CUDA is amazing, but it’s also created a hardware monopoly.

Today, if you’re building in CUDA, you’re locked to NVIDIA GPUs — or forced to juggle multiple codebases to support alternatives like AMD.

This bottleneck has kept developers from fully tapping into new, faster, and more affordable hardware options.

Scale changes all that.

Introducing Scale: Native CUDA on AMD Without Emulation

Unlike translation layers or emulation systems, Scale compiles CUDA directly to AMD’s architecture — meaning:

  • Zero emulation overhead
  • Direct performance on MI300X, MI325X and other AMD GPUs
  • Stay inside the CUDA ecosystem you already know

Instead of rewriting thousands of lines of code, developers can recompile and run — just like switching between x86 and ARM on CPUs.

🛠️ Full Compatibility Goals: CUDA Math, Runtime, and Driver APIs

Spectral Compute isn’t just stopping at basic functionality.

Their roadmap includes:

  • ~90% support for CUDA C Core Math APIs
  • ~70% support for CUDA Runtime and Driver APIs
  • Full NVCC C++ semantic support for seamless compilation

Even inline PTX assembly is supported — and in some cases, runs faster on AMD than NVIDIA hardware itself.

📊 Performance Gains: Scale vs HIP on AMD MI300X

Michael shared fresh benchmarks using the Rodinia Benchmark Suite:

  • Scale achieved up to 2x faster performance compared to HIP
  • Tests were run on RunPod.io using AMD’s MI300X GPUs
  • Performance wasn’t even the focus yet — initial priority was compatibility!

💡 Once optimization becomes the focus later this year, even bigger speedups are expected.

🧠 Warp Sizes, Inline Assembly, and Real Optimizations

One big technical hurdle: NVIDIA and AMD handle warp sizes differently.

  • NVIDIA: Warp size = 32
  • AMD: Warp size = 64

Scale smartly maps CUDA’s 32-thread assumptions onto AMD’s 64-thread warps, allowing most code to run without modification.

Even better?
Spectral Compute is making improvements to inline PTX — sometimes producing better machine code for AMD than NVIDIA itself provides for CUDA.

🌎 Open Collaboration: How You Can Get Involved

Spectral Compute is building this technology openly and wants the community involved:

  • Help test Scale and report issues
  • Contribute to open-source reimplementations of CUDA libraries
  • Extend the ROCm ecosystem together

👉 Join their Discord to collaborate, share feedback, or even send “hate mail” if something isn’t working. (Yes, they literally asked for it.)

💬 Michael's message: "We want to target all the hardware — not just some of it."

It’s a game-changer for AI, HPC, and scientific computing — making high-performance compute more accessible, flexible, and cost-effective than ever before.

📺 Watch the Full Talk 👉 Watch Michael Søndergaard’s Presentation at Beyond CUDA Summit

🚀 Deploy Faster with AMD GPUs 👉 Build, train, and infer at scale using AMD-powered AI infrastructure through TensorWave.

About TensorWave

TensorWave is the AI and HPC cloud purpose-built for performance. Powered exclusively by AMD Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.

Ready to get started? Connect with a Sales Engineer.