AMD unwraps Instinct MI500 boasting 1,000X more performance versus MI300X — setting the stage for the era of YottaFLOPS data centers
Next-generation CDNA 6 architecture on-track for 2027.
Get 3DTested's best news and in-depth reviews, straight to your inbox.
You are now subscribed
Your newsletter sign-up was successful
The demands of AI data centers compute capability are set to increase dramatically from around 100 ZettaFLOPS today to around 10+ YottaFLOPS* in the next five years (approximately by about 100 times), according to AMD. Thus, to stay relevant, hardware makers must increase performance of their products across the full stack every year. AMD does its best, so during the company's CES keynote its chief executive Lisa Su announced Instinct MI500X-series AI and HPC GPUs due in 2027.
"Demand for compute is growing faster than ever," said Lisa Su, chief executive of AMD. "Meeting that demand means continuing to push the envelope on performance far beyond where we are today. MI400 was the major inflection point in terms of delivering leadership training across all workloads, inference, and scientific computing. We are not stopping there. Development of our next-generation MI500-series is well underway. With MI500, we take another major leap on performance. It is built on our next gen CDNA 6 architecture [and] manufactured on 2nm process technology and uses higher speed HBM4E memory."
AMD's Instinct MI500X-series accelerators are set to be based on the CDNA 6 architecture (no UDNA yet?) With their compute chiplets made on one of TSMC's N2-series fabrication process (2nm-class). AMD says that its Instinct MI500X GPUs will offer up to 1,000 times higher AI performance compared to the Instinct MI300X accelerator from late 2023, but does not exactly define comparison metrics.
"With the launch of MI500 in 2027, we are on track to deliver 1000 times increase in AI performance over the last four years, making more powerful AI accessible to all," added Su.
Achieving a 1000X performance increase in four years is a major achievement, though we should keep in mind that between the Instinct MI300X and Instinct MI500 there is a three-generational instruction set architecture (ISA) gap (CDNA 3 => CDNA 6), a three generational memory gap (HBM3 => HBM4E), an addition of FP4 and other low-precision formats, faster scale-up interconnects, and possibly PCIe 6.0 interconnection to host CPU.
Nonetheless, the Instinct MI500 will be an all-new generation of AMD's AI and HPC GPUs with major architectural improvements, which probably include substantially higher tensor/matrix-compute density, tighter integration between compute and memory, and significantly improved performance-per-watt perhaps achieved by a combination of ISA and TSMC's N2P fabrication process.
*One YottaFLOPS equals to 1,000 ZettaFLOPS, or one million ExaFLOPS.
Follow 3DTested on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.
Get 3DTested's best news and in-depth reviews, straight to your inbox.

-
usertests ReplyAchieving a 1000X performance increase in four years is a major achievement, though we should keep in mind that between the Instinct MI300X and Instinct MI500 there is a three-generational instruction set architecture (ISA) gap (CDNA 3 => CDNA 6), a three generational memory gap (HBM3 => HBM4E), an addition of FP4 and other low-precision formats, faster scale-up interconnects, and possibly PCIe 6.0 interconnection to host CPU.
That's pretty unfathomable marketing. I have to imagine it's some edge case or something that couldn't run well in lower memory capacity, with lower precision added. -
edzieba ReplyAMD says that its Instinct MI500X GPUs will offer up to 1,000 times higher AI performance compared to the Instinct MI300X accelerator from late 2023, but does not exactly define comparison metrics.
Presumably Bungholiomarks. Anything else would hardly be considered a reputable performance metric! -
emerth I'm thinking 1000x the FP4 perf compared to Mi300 FP16 perf. That or AMD is implementing 2 bit FP.Reply -
bit_user Reply
Well, AMD claims MI300X had up to 5.22 POPS of sparse int8 performance. I wonder if they're comparing theoretical MI500X performance on something like BFP4 vs. The actual achieved performance of the MI300X. Even then, 1000x seems like quite a stretch. I could probably believe 100x, though.emerth said:I'm thinking 1000x the FP4 perf compared to Mi300 FP16 perf. That or AMD is implementing 2 bit FP. -
qwertymac93 The 1000x claim is probably for whole system performance, not just a single card. Taking into account interconnect advancements and larger addressable memory, 1000x seems possible, if unfair in the real world. You'd never use these systems beyond what they are clearly bottlenecked.Reply -
bit_user Reply
System performance is determined by your biggest bottleneck. It really doesn't matter what else you do, as that bottleneck will be the limiting factor.qwertymac93 said:The 1000x claim is probably for whole system performance, not just a single card. Taking into account interconnect advancements and larger addressable memory, 1000x seems possible,
As such, improvements in memory bandwidth, compute, and I/O are never multiplicative. One of them will be the limiting factor and how much you improved the system from whatever was your previous bottleneck is what determines system performance.
This statement doesn't make sense. The bottleneck fundamentally constrains actual use. You cannot push it beyond what the bottleneck allows.Qwertymac93 said:You'd never use these systems beyond what they are clearly bottlenecked. -
qwertymac93 Reply
Sure you can. You can run a 10GB model on an 8GB card and have part of the model paged in system RAM. And it'll run way slower since the bottleneck will have shifted. And then next year you can run the same model on a slightly faster card with 12GB of vram and claim a 100x improvement. 😏 That's what I was trying to imply here. Not that an actual customer would do that, but what customers do in the real world and what manufacturers claim aren't always aligned, are they?bit_user said:...
This statement doesn't make sense. The bottleneck fundamentally constrains actual use. You cannot push it beyond what the bottleneck allows. -
bit_user Reply
I'll accept that it could be something like that.qwertymac93 said:You can run a 10GB model on an 8GB card and have part of the model paged in system RAM. And it'll run way slower since the bottleneck will have shifted. And then next year you can run the same model on a slightly faster card with 12GB of vram and claim a 100x improvement. 😏
I wonder if the slide deck has been published. If so, it might contain some end notes which provide more insight into that number. Without more information, we can't really say any more. -
DS426 Hmm, more die space for tensor/matrix cores makes things trickier when talking about the transition to UDMA. Looks like we won't see it any earlier than 2028, especially if RDNA5 launches in 2027H2. I'm good with that though as some ISA improvement over RDNA4 combined with a bigger die (or dies) could produce a real beast. AMD just has to figure out what memory to employ that's both performant and not overly expensive (and available at all, lol).Reply