Meta's new MTIA lineup joins hyperscalers' unified push for dedicated inferencing chips — companies diversify AI chips in effort to diversify from sole reliance on Nvidia

MEMBER EXCLUSIVE

Meta announced four successive generations of its custom Meta Training and Inference Accelerator (MTIA) chips on March 11: The MTIA 300, 400, 450, and 500, all scheduled for deployment over the next two years. Meta described the chips as progressively optimized for AI inference workloads on the premise that HBM memory bandwidth is the binding constraint on inference.

Coming two weeks after Meta disclosed a long-term AI infrastructure with AMD, the announcement puts Meta alongside Google, AWS, and Microsoft, each of which has spent the last few years building and scaling custom silicon programs for AI accelerated workloads. Will this emerging class of chips put a dent in Nvidia's stranglehold on the AI chip industry?

An inference case against GPUs

In a technical blog post published alongside the announcement, Meta described HBM's bandwidth as the most important factor affecting AI inference performance, adding that mainstream chips, built for large-scale pre-training, are then applied less cost-effectively to inference workloads.

Article continues below

“We doubled HBM bandwidth from MTIA 400 to 450, making it much higher than that of existing leading commercial products,” it reads. The MTIA 500 then increases HBM bandwidth again by an additional 50% compared with the MTIA 450. Both chips are optimized primarily for AI inference but can be applied to other workloads, including training as a secondary use case.

The MTIA 300 is already in production for ranking and recommendations training. Meanwhile, the MTIA 400 — which features a 72-accelerator scale-up domain and performance — has completed lab testing and is on the path to data center deployment. The 450 and 500 are scheduled for mass deployment in early 2027 and later in 2027, respectively.

Across the full 300-to-500 progression, HBM bandwidth increases 4.5 times and compute FLOPs increase 25 times, with the MTIA 450's HBM bandwidth exceeding that of existing leading commercial products, while the MTIA 500 adds another 50% on top, along with up to 80% more HBM capacity.

According to Meta, the chips use a modular chiplet architecture that allows the MTIA 400, 450, and 500 to share the same chassis, rack, and network infrastructure. That compatibility means each new chip generation drops into the existing physical footprint without requiring new data center buildouts, the mechanism Meta cited for its roughly six-month development cadence, well faster than the industry's typical one-to-two year cycle. “More importantly, we have deployed hundreds of thousands of MTIA chips in production, onboarded numerous internal production models, and tested MTIA with large language models (LLMs) like Llama.”

Swipe to scroll horizontally

MTIA chips
Row 0 - Cell 0	MTIA 300	MTIA 400	MTIA 450	MTIA 500
Workload Focus	R&R Training	General	AI Inference	AI Inference
Module TDP	800 W	1,200 W	1,400 W	1,700 W
HBM Bandwidth	6.1 TB/s	9.2 TB/s	18.4 TB/s	27.6 TB/s
HBM Capacity	216 GB	288 GB	288 GB	384-512 GB
MX4 Performance	-	12 PFLOPS	21 PFLOPS	30 PLOPS
FP8/MX8 Performance	1.2 PFLOPS	6 PFLOPS	7 PFLOPS	10 PFLOPS
BF16 Performance	0.6 PLOPS	3 PFLOPS	3.5 PFLOPS	5 PFLOPS

Google, AWS, and Microsoft

Google announced Ironwood, its seventh-generation TPU, at Google Cloud Next in April 2025; the company described it as the first TPU purpose-built for inference and the beginning of an “age of inference,” distinct from the training-first era that preceded it. Ironwood delivers 192 GB of HBM3E per chip at 7.37 TB/s of memory bandwidth, per Google's published specifications, and scales to configurations of up to 9,216 AI accelerators.

Then, in December at re:Invent, AWS announced Trainium3, a 3nm chip with 144 GB HBM3E per chip at 4.9 TB/s bandwidth, with a single Trainium3 UltraServer connecting 144 chips. AWS has also maintained a separate Inferentia product line — a chip dedicated exclusively to inference — since 2019. Meanwhile, Microsoft introduced its Maia 200 for inference workloads built on TSMC 3nm, which it called its “most efficient inference system.”

Broadcom is what’s connecting the dots across many of these programs, having had a hand in building both Google’s TPUs (as the company’s silicon integrator) and Meta’s MTIA family. Meta described the MTIA chips as being developed “in close partnership with” Broadcom, and said that the company “has remained and will continue” to be a key partner of Meta’s AI infrastructure strategy.

Broadcom also notably secured an agreement back in October to help OpenAI build 10 GW of custom ASICs, with deployments beginning as early as this year. If nothing else, the role that Broadcom now plays across competing hyperscaler programs reflects both how capital-intensive custom silicon development is and how consistent the underlying architectural requirements have become.

This convergence continues with software stacks, with Meta building MTIA natively on PyTorch, vLLM, and Triton. Google also added TPU support for vLLM in beta, and AWS runs its Neuron SDK across PyTorch, TensorFlow, and JAX. These shared inference-serving frameworks ultimately determine how easily production workloads can port between chips, and portability is what will make the economics of switching from CUDA-locked Nvidia silicon as the default GPU credible at scale.

Nvidia retains training

None of this changes Nvidia’s position in large-scale pre-training. Frontier model development still overwhelmingly runs on high-end GPU clusters, and Nvidia’s Blackwell is the current standard for that workload. Meta itself operates large Nvidia GPU clusters alongside MTIA deployments, and its February 2026 AMD agreement adds further GPU capacity to a portfolio that already spans multiple silicon vendors.

Instead, what we’re seeing is workload segmentation, whereby custom silicon takes high-volume, predictable inference workloads and GPUs retain training. MTIA 450 and 500 are designed to cover AI inference production through 2027, while Google, AWS, and Microsoft have each made equivalent commitments on their own timelines.

At the point where inference represents the bulk of AI compute cycles, hyperscalers appear to have collectively decided that paying a premium for GPUs to run those workloads is no longer financially sound.

TOPICS