At ISSCC 2026,
A sign of things to come.
Receive 3DTested's top stories and detailed evaluations, delivered directly to your email.
You are now subscribed
Your newsletter sign-up was successful
Rebellions, a South Korea-based firm designing AI inference accelerators, recently presented details regarding its multi-chiplet Rebel 100 AI accelerator which employs Unified Chiplet Interconnect Express (UCIe) technology at the International Solid-State Semiconductor Conference (ISSCC). The processor is one of the industry's first multi-chiplet designs to rely on UCIe-A interconnects to stitch four chiplets together.
Multi-chiplet designs constitute the next generation of advanced AI and HPC hardware engines, as speed requirements greatly surpass the capacity of chip manufacturers to advance their production nodes. Major manufacturers of processors and graphics units such as AMD, Intel, and Nvidia have acknowledged the benefits of multi-chiplet architectures, and their most recent offerings completely adopt this approach.
The standard technique for connecting multiple chiplets — the UCIe interface — is intended to provide high-speed data transfer and minimal delay communication between chiplets. However, so far, the standard has been subject to slow adoption, which makes the ISSCC 2026 paper from Rebellions even more valuable.
Meet Rebel 100: A 2
The Rebellions Rebel100 represents a quad-chiplet AI hardware accelerator engineered for large language model inference tasks which utilizes a multi-chiplet architecture to optimize die yield and efficiency, aiming to provide the ideal equilibrium between price and throughput.
The Rebel 100 system-in-package (SiP) consists of four 320mm2 neural processing unit (NPU) dies, each outfitted with a 12Hi HBM3E 36 GB memory stack (amounting to 144 MB of HBM3E per package) and linked via a mesh topology with each other. The NPU chips are manufactured through Samsung's optimized SF4X fabrication technology and assembled via Samsung's I-CubeS (CoWoS-S–class) sophisticated packaging solution utilizing an interposer. For power integrity reasons, the SiP also features four integrated silicon capacitor (ISC) dies that also serve for mechanical purposes.
The chiplets are interconnected using a UCIe-Advanced die-to-die interface running at 16Gbps and providing an aggregated bandwidth of 4 TB/s. The connection reaches approximately 11ns Flit-Aware Die-to-Die (FDI) to FDI latency, which carries memory load–store semantics seamlessly across chiplets to allow the SiP to function as one unified processor, instead of a cluster of discrete dies.







Regarding the system architecture, Rebel100 links to host devices through a pair of PCIe 5.x x16 connections that facilitate SR-IOV and peer-to-peer functionality.
A single One Rebel 100 SiP is capable of achieving 2 FP8 PFLOPS or 1 FP16 PFLOPS of computing power excluding sparsity at 600W, which matches the output that Nvidia's H200 provides at 700W. Rebellions also claims that the unit can achieve 56.8TPS on LLaMA v3.3 70B with single-batch 2k/2k input/output sequences, though these are the numbers from the vendor itself, not from an independent tester. Additionally, the primary goal of this piece is to uncover the mechanics behind an inaugural multi-chiplet UCIe-based AI accelerator.
The company positions its Rebel 100 quad-chiplet package as a foundational unit for cross-node and rack-level systems capable of supporting trillion-parameter models and million-token contexts. Consequently, it remains uncertain if Rebellions intends to develop larger SiPs by utilizing current chiplets. But, it certainly envisions its partners building scale-up and scale-out clusters containing from dozens to tens of thousands of such AI accelerators.
Data flow in the Rebel 100
Each chiplet integrates two Neural Core Clusters, each packing eight neural cores and 32 MB of shared memory. According to the ISSCC paper, the shared memory is partitioned into 16 slices and features an aggregate bandwidth of 64 TB/s, and the chiplet contains 64 routers thatform an 8×4 granular mesh topology with three logically separate channels: Data (D), Request (R), and Control (C). In addition, each SiP contains 256 MB of scratchpad memory (at 128 TB/s).
The on-chip 2D network-on-chip (NoC) uses a straightforward XY routing scheme, so packets first travel along one axis and then the other, with turn restrictions applied to avoid deadlocks. Arbitration within routers is managed via a weighted round-robin process, so data from diverse origins is processed equitably while maintaining flexible priority levels. The quality-of-service weights can be modified at runtime to make the system favor certain traffic types depending on whether the workload is compute-heavy or memory-intensive.



The 2D NoC fabric within every chiplet virtually extends across UCIe, meaning the entire quad-chiplet system-in-package functions as a single massive mesh-connected processor from a logical perspective. Keeping in mind the low chiplet-to-chiplet latency (or rather FDI-to-FDI latency), this greatly simplifies life for software developers. Curiously, although every chiplet incorporates three UCIe-A interfaces for flexibility (or perhaps backup?), the complete arrangement expands to 256 routers throughout the whole mesh, so it is still unclear if Rebellions can build accelerators with more than four chiplets using the existing architecture.
While the UCIe 1.0 standards feature support for CXL.io, CXL.mem, and CXL.cache protocols via a PCIe 6.0 link, such arrangements are elective protocol configurations instead of essential obligations. This specification additionally facilitates proprietary streaming and memory-semantics procedures, which is precisely what Rebellions executed with The Rebel 100.
Rebellions built a fairly aggressive data-movement engine to keep its quad-chiplet design fed. Each NPU die integrates a configurable DMA subsystem with eight execution engines that can pull data from local HBM3E, remote HBM3E located on another chiplet, or from distributed shared memory. Bandwidth per DMA can reach up to 2.6 TB/s, which is arguably enough for an inference-focused accelerator. At the same time, to stop specific jobs from depriving others of resources, the firm established task-specific QoS mechanisms aimed at lowering long-tail latency and preventing bottlenecks when multiple workloads operate at once.
Managing tasks among four chiplets demands precise alignment. Rather than depending upon a specific scheduling unit, Rebellions integrated coordination controllers within every NPU. Every chiplet incorporates a specialized hardware synchronization manager featuring fixed control logic that manages operations across various dies, functioning either through central oversight or in a more independent fashion. This framework deliberately shuns straightforward point-to-point interactions among components and cross-component reliance to minimize superfluous data flow and management burdens and maintain peak system usage during Various processing stages of LLM inference.
To improve the reliability of its die-to-die interface, in addition to standard UCIe functionality, Rebellions implemented multiple loopback modes, transaction-level tracking, and channel-level diagnostics, which are generally intended to simplify validation and fault isolation in a multi-die package during debugging. In enterprise rollouts, Rebellions introduced an adjustable transition setting that utilizes the previously cited capabilities to forgo a minor degree of speed for the sake of better MTBF and MTTF ratings to Ensure peak availability, which is vital for extensive AI clusters where operational time is prioritized over slight throughput increases.
A new way to supply energy
The Rebel 100 accelerator is specified for a thermal design power of 600W TDP, but brief power spikes — occurring when various neural cores activate — surpass the standard rating by twofold. Because currents escalate swiftly and intensely, they cause voltage sags, presenting serious obstacles for the power integrity of the quad-chiplet AI accelerator.


To mitigate this, Rebellions implemented a hardware staggering technique that offsets start times of neural cores instead of activating them simultaneously, which smooths current ramps and reduces supply noise. Measurements show that synchronized switching produces steep current spikes and noticeable voltage disturbance, whereas staggered activation results in gentler transitions and a more stable power rail, according to Rebellions. Supplementary control circuitry actively restricts the instruction throughput during brief intervals to further minimize abrupt workload shifts inside an individual chiplet and across multiple dies.
Memory traffic adds another layer of stress. HBM3E bursts can be just as demanding as compute surges, which puts extra strain on the power delivery network. To reinforce it, Rebellions added dedicated integrated silicon capacitor (ISC) dies that embed distributed capacitance across the VDD rails to serve both the NPU and the HBM3E PHY. This approach further dampens voltage oscillations and lowers impedance peaks compared to a design without ISC dies.
A leading multi-chiplet architectural framework
Through its debut Rebel 100 multi-chiplet AI inference accelerator, Rebellions has attained performance levels matching Nvidia's H200 while consuming less electricity, even if it utilizes a much larger amount of silicon. A more substantial advancement for the firm is the Rebel 100 SiP, which stands as one of the market's earliest multi-chiplet accelerators employing UCIe-A interconnects.
Instead of building two large reticle-size dies, Rebellions opted for a quad-chiplet design with four 320-mm2 dies that are much easier to develop and yield, especially keeping in mind Samsung's pellicle-less approach to EUV that does not particularly favor large dies. To ensure the quad-chiplet configuration operates smoothly, Rebellions created a built-in 2D mesh network-on-chip that conceptually extends past the chiplet's limits through UCIe so the complete quad-chiplet system-in-package behaves like one large mesh-connected processor.
To enhance its architecture further, Rebellions bypassed conventional CXL-based protocols to instead integrate its proprietary adjustable DMA subsystem and synchronization managers. Moreover, to guarantee power stability, it utilized an exclusive hardware staggering method that levels out current surges and diminishes supply interference. Additionally, the firm incorporated integrated silicon capacitor (ISC) dies to mitigate voltage variations and minimize impedance spikes.
While not using the UCIe 1.0 specification to its full extent, the Rebel 100 represents a good example of a multi-chiplet design that relies on industry-standard interconnection while still using proprietary techniques to maximize performance and optimize the power of the system-in-package.
