Universal Modality-Injection Projectors

Updated 29 January 2026

Universal Modality-Injection Projectors are methods to fuse heterogeneous sensor inputs into language-aligned representations using iterative, coarse-to-fine cross-attention.
UMIP employs tailored encoders and pre-aligned CLIP embeddings to standardize sensor features, significantly boosting QA and captioning performance.
Empirical results on MM-Fi and XRF55 benchmarks show that UMIP enhances multimodal recognition accuracy and reasoning by effectively addressing sensor heterogeneity.

Universal Modality-Injection Projectors (UMIP) are core components in multisensory LLMs enabling the integration of rare and heterogeneous sensing modalities into language-grounded perception and reasoning. Introduced as part of HoloLLM, UMIP addresses two critical challenges in multimodal AI: the scarcity of aligned modality-text pairs for uncommon sensors, and the heterogeneity of physical signal representations (LiDAR, infrared, mmWave radar, WiFi, etc.). UMIP enhances pre-aligned modality embeddings with modality-specific, fine-grained features through a coarse-to-fine cross-attention mechanism. This results in representations that are jointly aligned to language space and deeply infused with each sensor’s unique spatial and temporal signature, supporting robust reasoning in complex and privacy-sensitive environments (Zhou et al., 23 May 2025).

1. Functional Overview and Context

UMIP functions as a "projector" module within HoloLLM, forming a bridge between pre-aligned shallow embeddings (obtained using CLIP encoders) and the fine-grained, high-resolution features derived from frozen, modality-specific tailored encoders. Coarse CLIP embeddings serve as initial queries, while tailored encoder features are injected as cross-attention keys and values. This architecture enables adaptation to diverse sensor input types while maintaining alignment with a LLM’s semantic space.

UMIP iteratively refines these initial queries, integrating information drawn from each sensor's spatial and temporal details. The output is a compact token sequence, $Z_m$ , of size $n'_{m} \times d_{\text{LLM}}$ , suitable for direct ingestion by a LLM such as LLaMA2-7B ( $d_{\text{LLM}} = 4096$ ).

2. Internal Architecture

UMIP layers are structured around $L$ transformer-style blocks (with $L=8$ in HoloLLM), each performing three sequential operations:

Self-Attention: Operates on the current query matrix.
Coarse-to-Fine Cross-Attention: Projects the self-attended queries onto fine-grained keys and values derived from tailored encoders.
Feed-Forward Network (FFN): Updates token representations with a two-layer MLP and GELU activation.

Inputs include:

Raw sensor data ( $X_m$ ) per modality.
CLIP-derived embeddings ( $Y_{\text{CLIP}m} = E_{\text{CLIP}}(X_m) \in \mathbb{R}^{n_m \times d_m}$ , with $d_m=1024$ ).
Tailored encoder features ( $Y_{Tm} = \text{MLP}_m(E_{Tm}(X_m)) \in \mathbb{R}^{h_m \times w_m \times d_m}$ ), flattened into $K_m, V_m \in \mathbb{R}^{(h_m \cdot w_m) \times d_m}$ .
Coarse query initialization: $Q_m = \text{AvgPool1D}(Y_{\text{CLIP}m}) \in \mathbb{R}^{n'_m \times d_m}$ .

Each UMIP block applies standard multi-head attention:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QW_Q (KW_K)^T}{\sqrt{d_k}}\right) VW_V$

The iterative block structure is as follows:

$S^{(l)} = \text{Attention}(Q^{(l-1)}, Q^{(l-1)}, Q^{(l-1)})$
$\widetilde{Q}^{(l)} = \text{LN}[Q^{(l-1)} + S^{(l)}]$
$C^{(l)} = \text{Attention}(\widetilde{Q}^{(l)}, K_m, V_m)$
$\widehat{Q}^{(l)} = \text{LN}[\widetilde{Q}^{(l)} + C^{(l)}]$
$F^{(l)} = \text{FFN}(\widehat{Q}^{(l)})$
$Q^{(l)} = \text{LN}[\widehat{Q}^{(l)} + F^{(l)}]$

After $L$ blocks, $Q^{(L)}$ is linearly projected to $Z_m$ compatible with the target LLM.

3. Sensor Heterogeneity and Dimension Matching

UMIP explicitly addresses heterogeneity in sensor representations by employing:

Universal CLIP Pre-Alignment: All modalities, regardless of native signal characteristics, are pre-aligned into a shared embedding space using a frozen CLIP encoder, facilitating zero-shot bootstrapping with no extra data.
Tailored Encoders: Sensor-specific encoders (e.g., ResNet18 for vision/depth/IR, PointNet for LiDAR/radar, temporal ResNet for RFID, MetaFi for WiFi) extract distinctive features for each modality, which are then MLP-projected to standardized dimensionality $d_m=1024$ .
Dynamic Query Injection: Queries attend to keys/values from all sensors through cross-attention, adaptively fusing fine-grained spatial/temporal information with language-aligned context.

This design enables the fusion of multimodal features in a unified, text-aligned space, supporting complex human-centric reasoning tasks across variable environments.

4. Training Regimen and Objectives

Training follows a two-stage process:

Stage	Components tuned	Objective(s)	Losses
1	Tailored encoders ( $E_{Tm}$ )	Modality classification	$L_1 = CE(\text{Classifier}(E_{Tm}(X_m)), c)$
2	UMIP + LLM	Action classification, QA, language modeling	$L_2 = CE_1(\text{LLM}(Z_m, Z_\text{text}),c) + L_{\text{next}}$

No alignment-specific losses are directly applied to UMIP tokens. UMIP parameters are tuned only during the second stage, relying solely on standard language modeling and classification objectives, with all encoder backbones kept frozen after stage 1 (Zhou et al., 23 May 2025).

5. Empirical Impact

UMIP’s contribution to language-grounded sensing is substantiated by ablation on MM-Fi and XRF55 benchmarks. On MM-Fi, the addition of tailored encoders raised recognition accuracy from 10.3% to 58.5% and QA accuracy to 46.6%. Further incorporation of UMIP improved QA to 56.4% and captioning from 21.9 to 22.6. On XRF55, UMIP improved QA by 1.0 percentage point over the tailored-encoder baseline (to 12.8%). This demonstrates that UMIP sharpens the alignment between sensor features and downstream language-guided reasoning—particularly boosting QA performance by up to 10 percentage points in complex environments.

6. Design Choices and Ablation Insights

Key architectural parameters and their effects include:

Number of Transformer Blocks ( $L$ ): Empirically set to 8 for optimal QA accuracy; $L$ between 6–10 offers a balance between computational cost and performance.
Query Token Count ( $n'_m$ ): Larger for high-resolution modalities (e.g., 256 tokens for LiDAR/WiFi, 64 for vision/depth/IR). Using fewer than 32 tokens degrades QA by ∼5 percentage points.
Fusion Strategy: Sequential cross-attention—having each modality's features injected serially—proves superior to concatenating all features into a single key/value pool (which reduces QA by 3 percentage points).

These findings suggest that careful calibration of block depth and token granularity is essential for optimizing the fine-to-coarse integration process and overall reasoning capability.

7. Significance and Broader Implications

UMIP demonstrates that iterative, cross-modal attention between pre-aligned and high-resolution modality features can overcome the data scarcity and heterogeneity issues intrinsic to multisensory reasoning. By aligning diverse sensor signatures with LLM-proximal representation, UMIP establishes a robust architectural pattern for real-world, language-informed embodied intelligence in environments hampered by occlusions, variable lighting, or privacy constraints. A plausible implication is the extensibility of the UMIP architecture to other emerging modalities or deployment scenarios where conventional vision-LLMs would be insufficient (Zhou et al., 23 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Universal Modality-Injection Projectors (UMIP).