RoMa v2: QLoRA Accelerator & Vision Matcher

Updated 21 November 2025

The paper introduces a ROM-based accelerator with hybrid ultra-dense ROM and SRAM, employing fused integer MAC operations to achieve up to 40K tokens/s for on-device LLM inference.
The paper presents a dense feature matcher that uses a two-stage transformer pipeline with frozen DINOv3 features, delivering state-of-the-art accuracy and speed in geometric vision tasks.
Both systems leverage domain-specific innovations such as quantization, multi-chip tiling, and CUDA optimizations to significantly boost efficiency, robustness, and performance.

RoMa v2 refers to two distinct yet high-impact systems: (1) a hardware accelerator for on-device QLoRA LLM inference with innovations in ROM-based storage (Wang et al., 17 Mar 2025), and (2) a state-of-the-art dense feature matcher in computer vision employing advanced transformer architectures, DINOv3 features, and two-stage matching paradigms (Edstedt et al., 19 Nov 2025). Both introduce domain-specific algorithmic and architectural improvements aimed at maximizing efficiency, robustness, and performance in their respective applications.

1. ROMA-based Accelerator for QLoRA LLMs

1.1 Hybrid Storage and Compute Microarchitecture

RoMa v2 builds on the ROMA accelerator, leveraging a dual-memory hierarchy: ultra-dense Read-Only Memory (ROM) for static, quantized base model weights, and SRAM for mutable LoRA adapters, attention KV cache, and intermediate results. The ROM, exploiting ≈3× the density of SRAM in TSMC 7nm technology, enables on-chip storage of up to 1.86 GB (sufficient for an entire 4-bit 3B or 2-bit 8B LLaMA model), eliminating dependency on external DRAM. SRAM (304 MB in the reference design) supports up to 4K tokens in the cache with rank-64 adapters.

Processing is organized as a 2-D array (17×16) of tiles, each comprising:

L-Unit (low-precision fused cell): Holds quantized base weights in B-ROM; performs 2/4-bit integer multiply–accumulate (MAC).
H-Unit (high-precision): SRAM-based; operates on LoRA/adapter and KV data using FP8/FP16 matrix–vector operations.

Element-wise operations, exponent alignment, reductions, and permutations are handled by central vector units in FP16. The fused cell structure places B-ROM and compute within the same standard-cell area, minimizing wire length and improving metal–transistor utilization, yielding ≈10% die area reduction over macro-based layouts.

1.2 B-ROM Circuit and Area Efficiency

Block-ROM (B-ROM) introduces hierarchical address decoding, reducing transistor count and switching activity relative to conventional ROM. Address lines are grouped into blocks (k=4 typical), each with a candidate generator (CGen) precomputing all $2^k$ possible outputs, followed by a block multiplexer and a global OR tree. This organization yields about 40% area savings and proportionally reduced static and dynamic power compared to standard ROM.

B-ROM area:

$A_{convROM} \approx A_{decoder} + D \cdot W \cdot A_{tr}$
$A_{B-ROM} \approx A_{decoder} + (D/k)[W \cdot A_{tr}+A_{CGen}] +A_{interOR}$ where $D$ is depth, $W$ word width, and $A_{tr}$ per-transistor area.

1.3 Quantization and Integer MAC Fusion

Weights are blockwise quantized (2- or 4-bit) with per-block (128 element) scale $s$ (FP16) and zero-point $z$ (UINT2 for INT4, UINT1 for INT2). Activations are FP16, converted to shared-exponent integers. The MAC operation is performed fully in the integer domain:

$w_{real} = (w - z) \times s$

Within L-Units, the dot product is:

$raw = \sum_{i=1}^N value_i \cdot w_i \ correction = vsum \cdot z \ dequant = (raw - correction) \times s \ res = dequant \cdot 2^{(max\_exp - Bias)}$

Fusing B-ROM OR-plane outputs into the multiplier inputs eliminates a pipeline stage (saving ≈1 clock cycle, ≈0.5 pJ/op).

1.4 Performance Metrics and Scalability

RoMa v2 achieves the following representative metrics:

Model	ROM (GB)	SRAM (MB)	TTFT (ms, 256 tokens)	Peak Decode (tokens/s)	Area (mm²)	Power (W)
4-bit 3B LLaMA	1.86	304	5.6	31.8K (≥20K sustained)	503.7	33.1
2-bit 8B LLaMA	-	304	6.9	24.1K	-	-

KV cache is scalable to >8K tokens via SRAM up-sizing (e.g., 512 MB for extended cache) with <5% throughput penalty.

1.5 Next-Generation Enhancements

Key RoMa v2 advances include:

Multi-chip tiling across B-ROM segments for 13B–30B LLMs, using high-speed inter-die interconnects with <1 ns bond-wire latency.
Hierarchical ROM: hot subset in SRAM; cold in ultra-dense ROM.
Finer quantization: INT3 with per-row zero points; mixed signed/unsigned quantization for attention layers.
DVFS: Dynamic scaling to 0.6V for idle/short stages, up to 50% dynamic power savings.
Programmable B-ROM block-size CGen: layout-time optimization for model size.
FinFET MUXes reduce local MUX area by ≈20%.
Monolithic 3D Integration: B-ROM metal layers stacked atop compute, enabling area savings (≈8%) and shorter routing.

Projected figures: ≈450 mm² area for 8B model, ≤25 W at 500 MHz, >40K tokens/s for 3B, >30K tokens/s for 8B, >10K token KV cache with DVFS. RoMa v2 thus achieves >2× area, >1.5× energy efficiency, and >25% higher throughput relative to ROMA (Wang et al., 17 Mar 2025).

2. RoMa v2 Dense Feature Matching Network

2.1 Two-Stage Detector-Free Matching Architecture

RoMa v2 refines dense matching by employing a "coarse-then-refine" pipeline:

Coarse Stage: A multi-view transformer, initialized from frozen DINOv3 ViT-L features (layers 11, 17, $A_{convROM} \approx A_{decoder} + D \cdot W \cdot A_{tr}$ 0), processes concatenated sequences projected to 768 dimensions. The transformer (ViT-B, 12 layers, 12 heads) alternates between global and frame-wise self-attention with normalized rotary position encoding. Coarse similarity matrices $A_{convROM} \approx A_{decoder} + D \cdot W \cdot A_{tr}$ 1 are constructed via cosine similarity and temperature scaling ( $A_{convROM} \approx A_{decoder} + D \cdot W \cdot A_{tr}$ 2), feeding an NLL-style auxiliary loss.
Refinement Stage: Three UNet-like CNN refiners (strides 4, 2, 1) process VGG19 features, the upsampled coarse warp, displacement fields, and custom CUDA-computed local correlation patches. Each outputs a sub-pixel displacement, confidence update, and 2×2 Cholesky-parameterized precision matrix $A_{convROM} \approx A_{decoder} + D \cdot W \cdot A_{tr}$ 3.

2.2 Custom Losses

Matcher Loss $A_{convROM} \approx A_{decoder} + D \cdot W \cdot A_{tr}$ 4:

$A_{convROM} \approx A_{decoder} + D \cdot W \cdot A_{tr}$ 5

Refiner Loss (per stride $A_{convROM} \approx A_{decoder} + D \cdot W \cdot A_{tr}$ 6):

$A_{convROM} \approx A_{decoder} + D \cdot W \cdot A_{tr}$ 7

where $A_{convROM} \approx A_{decoder} + D \cdot W \cdot A_{tr}$ 8 is the NLL under $A_{convROM} \approx A_{decoder} + D \cdot W \cdot A_{tr}$ 9.

Ablations demonstrate the centrality of the NLL term, precision head, and frozen DINOv3 features for performance and robustness.

2.3 Training Data and Augmentation

RoMa v2 is trained on a mixture of nine datasets straddling wide and small baselines, hand-balanced via mixture weights and thresholds across MegaDepth, AerialMD, BlendedMVS, HyperSim, TartanAir v2, Map-Free, ScanNet++ v2, FlyingThings3D, and VKITTI2/UnrealStereo4k. Augmentations consist of: horizontal flip, grayscale, hue/brightness jitter, random translation.

2.4 End-to-End Pipeline and CUDA Optimization

Inference sequence:

Extract DINOv3 features at full resolution.
Run the coarse matcher at ¼ resolution.
Upsample predicted warp/confidence to full resolution.
Apply refiners at strides 4, 2, 1.

The custom CUDA kernel computes local patch correlations in $A_{B-ROM} \approx A_{decoder} + (D/k)[W \cdot A_{tr}+A_{CGen}] +A_{interOR}$ 0 time with constant extra memory, reducing memory usage in high-res refiners by ≈15%. End-to-end throughput achieves 30.9 image pairs/sec on H200 GPU (batch=8, $A_{B-ROM} \approx A_{decoder} + (D/k)[W \cdot A_{tr}+A_{CGen}] +A_{interOR}$ 1), outperforming RoMa (18.5 pairs/s) and being significantly more memory efficient than UFM (4.8 GB vs 16.2 GB peak).

2.5 DINOv3 Backbone and Generalization

RoMa v2 uses a frozen DINOv3 ViT-L backbone for initial feature extraction, preserving shape-bias and generalizability. Empirical results show substantial improvement: MegaDepth linear probe EPE drops from 27.1 (DINOv2) to 19.0 (DINOv3); robustness increases from 77.0% to 86.4%. Freezing also mitigates sub-pixel bias and appearance shift overfitting, further stabilized by exponential moving average over parameter snapshots.

2.6 Experimental Results and Benchmarks

RoMa v2 sets new state-of-the-art benchmarks:

Relative pose estimation (MegaDepth-1500 AUC@5°/10°/20°): 62.8/77.0/86.6 (on par with best 3D methods)
ScanNet-1500: 33.6/56.2/73.8
Dense matching EPE (pixels): TA-WB 13.82 (UFM 15.85, RoMa 60.61); MegaDepth 1.47 (UFM 2.34)
SatAst AUC@10px: 37.0% (RoMa 23.5%, UFM 1.8%)

Key ablation findings include a +30 PCK gain from the NLL term and +20 AUC@1° with predicted covariance-weighted refinement.

3. Comparative Summary of RoMa v2 Variants

Domain	Main Innovation	Core Technology	Peak Efficiency	Open-source
LLM hardware	B-ROM, hybrid memory, DVFS	On-device QLoRA ASIC	>40K tokens/s, ≤25 W	-
Vision matcher	Multi-view transformer, CUDA, frozen DINOv3	Dense 2-stage matching	SOTA accuracy & speed	Yes

While sharing nomenclature, each RoMa v2 project targets distinct challenges: scalable LLM inference on edge hardware (Wang et al., 17 Mar 2025) and robust, accurate feature matching in geometric computer vision (Edstedt et al., 19 Nov 2025). Both demonstrate the impact of architectural modularity, efficient memory hierarchy, and integration of large-scale pretrained models tailored to their respective deployment constraints and performance targets.

Markdown Report Issue Upgrade to Chat

References (2)

ROMA: a Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM (2025)

RoMa v2: Harder Better Faster Denser Feature Matching (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RoMa v2.