LaMoSys3.5D: Heterogeneous 3.5D-IC for LLM Inference
- LaMoSys3.5D is a scalable 3.5D integrated circuit architecture that uses heterogeneous chiplets, including compute-rich PCs and bandwidth-rich DCs, for efficient LLM inference.
- It employs a hardware/software co-design with innovative D³ dataflow and mesh-aware PE mapping to balance compute and memory demands in both prefill and decode phases.
- The system delivers over 60% token/W improvement and 4–9× latency reduction relative to state-of-the-art GPU and prior 3D-DRAM inference systems.
LaMoSys3.5D is a scalable 3.5D integrated circuit (IC) architecture designed to optimize LLM inference serving through a heterogeneous composition of 3D-DRAM chiplets on a 2.5D silicon interposer. This platform employs a hardware/software co-design paradigm to balance the compute-intensive prefill and the bandwidth-intensive decode phases, providing improved throughput-per-watt and significantly lower latency relative to state-of-the-art GPU and prior 3D-DRAM-based inference systems (Wang et al., 9 Dec 2025).
1. Heterogeneous 3.5D-IC Architecture
LaMoSys3.5D organizes heterogeneous chiplets into a grid, leveraging two principal types:
- Prefill-optimized Chiplets (PC): These are compute-rich, containing higher counts of processing elements (PEs) and fewer DRAM layers ().
- Decode-optimized Chiplets (DC): Bandwidth- and capacity-rich, these chiplets utilize more DRAM layers (), wider through-silicon-via (TSV) interfaces, larger on-die DRAM capacity, and fewer PEs.
A representative system instantiates a or chiplet grid, partitioning PCs and DCs to spatially align with the distinct traffic profiles of model weights, activations, and the key/value (KV) cache. The chiplets are interconnected by four AIB-2.0 PHY ports, each supporting up to 200 GB/s, yielding 800 GB/s aggregate chiplet interconnect bandwidth. Internally, each chiplet features a mesh network-on-chip (NoC), with flit widths attuned to DRAM channel widths.
Each chiplet stacks DRAM atop a 7 nm logic die via hybrid bonding, with each DRAM layer comprising banks and TSV interfaces bits wide (typ. , ). DRAM command scheduling uses closed-page, FCFS policy, and layer-interleaved bank distribution to mask activation and transfer timing (, , 0; JESD79-4).
2. Dataflow and Parallelization Co-Design
2.1 Intra-PE Dataflow: Direct-DRAM-Delivery (D³)
LaMoSys3.5D introduces the "D³" intra-PE dataflow where GEMM computations 1 are tiled by 2 and assigned a reuse policy 3 subject to on-PE SRAM buffer size 4. Feasibility conditions are imposed (e.g., input-reuse: 5). The D³ mechanism enables selective tiles (such as 6 weights) to be directly streamed from 3D-DRAM, with critical tiles staged in SRAM. Analytical cost modeling for a given tile configuration estimates execution cycles as follows:
7
An exhaustive search across the design space yields the (nearly) optimal tile and reuse policy.
2.2 PE Mapping and Scheduling
Parallelization employs tensor-parallel (TP), pipeline-parallel (PP), and optional data-parallel (DP) schemes. TP groups of PEs are determined per pipeline stage 8 via a mixed-integer linear program (MILP) to minimize both all-reduce diameter and inter-group distances:
9
Simulated annealing is used for PP stage placement to further minimize stage latency and KV handoff costs. A dynamic ORCA-style scheduler overlaps communication and adjacent compute to maximize throughput for dominant transformer computations (QKV projections, output projections, feedforward layers).
3. Thermal-Aware Modeling and Hierarchical Design Space Exploration
3.1 Thermal Modeling
A compact electrical circuit model represents thermal resistances (0) in the 3.5D stack, covering DRAM layers, logic, and cooling path. The composite temperature is given by:
1
where leakage power 2 increases roughly exponentially, and DRAM refresh interval 3 halves every 4C increase above 5C, reducing accessible bandwidth. Transient temperature evolution is simulated using ATSim3D, and DRAM refresh penalties are dynamically accounted for in memory performance.
3.2 Hierarchical Design Space Exploration (DSE)
At the chiplet level, the design space is spanned over 6, 7, 8, number of cores, base-SA partition sizes, and SRAM capacities. Bayesian optimization synthesizes Pareto-efficient solutions for compute (TFLOPS), bandwidth (TB/s), and memory capacity (GB). At the system level, permutations of PC/DC chiplet ratios are evaluated to maximize throughput-per-watt, subject to strict service-level objectives (SLOs), power, thermal, area, and capacity constraints such as:
9
Simulation-based event-driven loops converge on designs satisfying these multidimensional criteria.
4. Quantitative Performance Analysis
Comprehensive benchmarking of LaMoSys3.5D highlights substantial improvements over baseline and prior platforms:
| Metric | LaMoSys3.5D | DGX-A100 / 3D Baselines |
|---|---|---|
| Throughput-per-Watt | 0.75 tokens/s/W | 0.46 tokens/s/W (+62% LaMoSys) |
| Prefill TTFT (Time-to-First) | 2.13Ă— DGX-A100 | - |
| Decode TBT (Time/Token) | up to 17Ă— faster | - |
| End-to-end latency (geo-mean) | 4.87Ă— lower than single-3D | TETRIS: 2.99Ă—, 3D-LC: 3.02Ă—, 3D-TokSIM: 8.58Ă— |
| Decode EDP (batch=16, len=2048) | D³ baseline | Token-stationary: +13%, ARU: +9%, TETRIS: +1% |
For prefill (compute-bound), performance deltas are minor (0), while decode (memory-bound) demonstrates D³'s efficiency. Under benchmarked traces, PC maximum temperatures reach 1C, DC reach 2C; a 3C temperature rise reduces DRAM bandwidth by 4 and increases logic leakage by 5.
5. Design Principles and Practical Guidelines
Key principles distilled from LaMoSys3.5D's evaluation include:
- PD-disaggregation on 3.5D-ICs: Strategic co-location of compute-lean PCs and memory-abundant DCs enables prefill vs. decode workload alignment.
- Short-wide base-SAs + D³ flow: Systolic arrays are subdivided for high GEMV utility at low batch, permitting direct weight streaming from DRAM.
- Exhaustive intra-PE mapping + mesh-aware PE placement: Full (TM, TN, TK, RU) enumeration and mesh-centric MILP/SA grouping minimize communication overheads.
- Coupled thermal-performance modeling: Integrating transient thermal simulation with refresh-aware memory models precludes bandwidth erosion at elevated stack temperatures; core counts per chiplet must be limited (6 cores/PE) to respect thermal budgets.
- Hierarchical and constraint-driven DSE: Chiplet vs. system design objectives are cleanly partitioned and jointly solved under QoS, area, power, and capacity constraints.
6. Significance and Impact on Inference Serving
LaMoSys3.5D demonstrates, for the first time, a platform in which a 3.5D-IC with heterogeneous 3D-DRAM chiplets, DRAM-native dataflow, mesh-aware parallel mapping, and early thermal modeling collectively realizes 7 increase in token/W and 8 latency reduction relative to the best-known GPU- and 3D-DRAM-based inference architectures. This represents a significant evolution in balancing architectural specialization with end-to-end LLM serving efficiency (Wang et al., 9 Dec 2025).