Papers
Topics
Authors
Recent
Search
2000 character limit reached

ROMA Accelerator for On-Device LLMs

Updated 16 March 2026
  • ROMA is a read-only-memory-based accelerator that integrates quantized base-model weights in high-density B-ROM for efficient on-device LLM inference.
  • It features a heterogeneous memory and compute architecture with a fused-cell design to deliver over 20,000 tokens per second at approximately 1 mJ per token.
  • The accelerator’s on-chip ROM and SRAM organization supports scalable LoRA adaptations and minimal quantization loss, optimizing performance for edge deployments.

ROMA is a Read-Only-Memory-based Accelerator developed for QLoRA-based on-device LLMs. It leverages a hybrid storage hierarchy, employing high-density B-ROM macros for stable, quantized base-model weights, and SRAM for dynamic, writable data such as LoRA adapters, key-value (KV) cache, and activation buffers. ROMA enables the entire quantized base model to reside on-chip, facilitating rapid inference with over 20,000 tokens per second generation speed and energy efficiency on the order of 1 mJ per token, all without external memory requirements. The architecture further incorporates a fused-cell design, maximizing area efficiency through integrated compute and storage. This approach addresses edge deployment needs for LLMs by balancing performance, storage density, and flexibility for on-device adaptation (Wang et al., 17 Mar 2025).

1. Hardware Architecture and Computational Dataflow

ROMA’s architecture centers on a heterogeneous memory and compute system (Fig. 3). The storage hierarchy consists of 1.86 GB of on-chip B-ROM for quantized base-model weights (supporting 4-bit 3B and 2-bit 8B LLaMA models) and 304 MB of on-chip SRAM, with 288 MB allocated for LoRA weights and KV cache, and 16 MB for intermediate activations.

The compute array is organized as a 17×16 2-D mesh comprising matrix- and vector-processing units:

  • The middle row contains a 1×16 vector unit (handling element-wise ops, reductions, shuffles).
  • The top and bottom 8×16 rows are matrix units, split into H-Units (SRAM-backed for LoRA weights and KV cache in FP8/FP16) and L-Units (ROM-backed for quantized dot-products).

Inference proceeds by fetching quantized weights in groups of 128 from ROM (each group associated with a shared FP16 scale ss and 2-bit zero point zz), followed by alignment of FP16 input activations to shared exponents. The low-precision dot-product is evaluated as

res=(DotProduct(value,w)vsumz)s2(max_expBias)\text{res} = \left(\mathrm{DotProduct}(\mathbf{value}, \mathbf{w}) - \mathrm{vsum} \cdot z\right) \cdot s \cdot 2^{(\text{max\_exp} - \mathrm{Bias})}

The H-Unit subsequently applies the LoRA rank-rr correction,

Weff=Wbase+ABTW_{\mathrm{eff}} = W_{\mathrm{base}} + A B^T

where A,BFP8A,B \in \text{FP8} of rank rr, and outputs are written to on-chip SRAM KV cache for subsequent autoregressive decoding (Wang et al., 17 Mar 2025).

2. B-ROM Cell Structure and Physical Integration

ROMA introduces the B-ROM (Block-ROM) macro for density-optimized ROM storage (Fig. 5). Unlike conventional ROM arrays, which require one transistor per bit, B-ROM groups every four addresses into a block and generates 16 candidate output patterns per block using a compact combinational (CGen) circuit, drastically reducing per-block transistor requirements. The total transistor count for a D×WD \times W array becomes approximately (D/4)(W+NUMCGEN)(D/4) \cdot (W + \text{NUM}_\text{CGEN}), which, when WNUMCGENW \gg \text{NUM}_\text{CGEN}, yields a 40%\sim40\% area reduction versus standard compiler-generated ROM.

Circuit-level benefits include reduced bitline capacitance CBLC_\text{BL} and efficient access time,

taccesstdec+tCGen+tbitlineCBL/Ireadt_\text{access} \approx t_\text{dec} + t_\text{CGen} + t_\text{bitline} \cdot C_\text{BL} / I_\text{read}

operable at 500 MHz in TSMC 7nm.

A key physical integration feature is the "Fused-Cell" (Fig. 6), which tightly co-locates a B-ROM block beneath or above a compute processing element (PE) in a single standard cell footprint, optimizing both metal and base-layer utilization and reducing area overhead by an additional 10–15% beyond B-ROM alone (Wang et al., 17 Mar 2025).

3. On-Chip Storage Hierarchy and Model Capacity

The on-chip memory organization supports practical hosting of sizable quantized LLMs:

Storage Type Capacity Usage Details
ROM (B-ROM) 1.86 GB Quantized base-model weights (4b 3B or 2b 8B LLaMA)
SRAM 288 MB LoRA weights (FP8) + KV cache (FP16), up to 4K tokens
SRAM 16 MB Intermediate activation buffers

This configuration fits a 3B LLaMA model at 4-bit quantization (3×10⁹ weights × 4 bits ≈ 1.5 GB plus overhead), or an 8B LLaMA at 2 bits (8×10⁹ × 2 ≈ 2 GB, facilitated by B-ROM compaction). The 288 MB SRAM supports LoRA modules up to rank 64 per layer, and KV caches for up to 4,000 tokens at 8,192 total embedding dimension (4096 for key, 4096 for value) with 2 bytes per entry (Wang et al., 17 Mar 2025).

Area efficiency is derived from the formula Acell B-ROM0.33×Acell SRAMA_\text{cell B-ROM} \approx 0.33 \times A_\text{cell SRAM}, with the effective bit-cell area in fused layout reaching 0.25×0.25 \times SRAM bit area.

4. Performance Metrics and Energy Efficiency

At 500 MHz (TSMC 7 nm), ROMA delivers the following performance:

Scenario (4b-3B, rank-16 LoRA) Time-to-First-Token Tokens/s Decoding (no cache) Power Energy/token
256-token prompt 5.6 ms
4 K-token prompt 140.2 ms 31.8 K (3B), 24.1 K (8B) 33.1 W 1.04 mJ

Decoding throughput remains above 10K tokens/s at full cache utilization. Power consumption is 33.1 W and energy efficiency is 1.04 mJ/token (4b-3B). Compared to baseline edge hardware (i5-1135G7 at 6.8 tokens/s, RTX 4090 at 219 tokens/s), ROMA is ∼2,785× and ∼70× faster than the CPU and GPU, respectively, and offers approximately 1,000–2,000× improved energy per token (CPU: 4.12 J/token; GPU: 2.05 J/token) (Wang et al., 17 Mar 2025).

Quantization impacts are minimal: the 2-bit 8B model, despite having twice the parameter count, is only 20–30% slower than the 4-bit 3B, with accuracy loss under 1% for standard downstream benchmarks.

5. Implementation, Silicon Evaluation, and Scalability

Fabricated in TSMC 7 nm, the total die area is 503.7 mm² with 33.1 W power at 500 MHz. Memory breakdown is 1.86 GB ROM (B-ROM) plus 304 MB SRAM. Area and power distribution are approximately: 45% area/40% power for the ROM+compute L-Units, 30% area/25% power for the SRAM+compute H-Units, with the remainder for vector units and router resources.

ROMA scales with LoRA rank (up to 64) and KV cache length. With 256 MB SRAM, a 3B model at rank 16 can buffer ≃3.8K tokens (Fig. 10). Future scalability is supported by further B-ROM miniaturization (e.g., grouping 8 addresses per block), or migration to advanced process nodes (e.g., 5nm), potentially enabling on-chip storage of 13B or larger models given the same 1.5 GB ROM budget. Mixed quantization (2/4/8-bit) is natively supported through B-ROM layout and minor alignment logic modifications (Wang et al., 17 Mar 2025).

6. Context and Significance

ROMA’s central innovation is the deep integration of storage and compute for on-device LLM inference, specifically tailored to QLoRA’s model structure: a fixed, quantized base augmented with adaptable LoRA modules. The B-ROM macro, combined with fused-cell physical design, enables high on-chip storage density for frozen, quantized model parameters, obviating off-chip DRAM and streamlining power and area consumption. This architecture meets the stringent latency, throughput, and privacy requirements of edge inference scenarios, marking a substantive advance in hardware/software co-design for resource-constrained LLM deployment (Wang et al., 17 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ROMA Accelerator.