Viper-F1: Hybrid State-Space Vision-Language Model

Updated 21 November 2025

Viper-F1 is a hybrid state-space vision-language model that efficiently replaces quadratic cross-attention with token-grid correlation and structured state-space dynamics.
It integrates a SigLIP-based Vision Transformer, a CoSSM module with FiLM conditioning, and a liquid LLM decoder for streamlined multimodal processing.
Empirical evaluations demonstrate competitive accuracy and reduced latency, making it well-suited for real-time applications in robotics and embedded systems.

Viper-F1 is a hybrid state-space vision-LLM designed for fine-grained multimodal understanding with high computational efficiency. It departs from standard Transformer-based architectures by replacing quadratic-complexity cross-attention mechanisms with structured liquid state-space dynamics and lightweight cross-modal fusion. Viper-F1 integrates a SigLIP-based Vision Transformer (ViT) encoder, a Cross-modal State-Space Modulator (CoSSM) utilizing token-grid correlations and Feature-wise Linear Modulation (FiLM), and a liquid-based LLM decoder to yield accurate, detail-oriented text generation given visual-textual prompts. The resulting model enables linear-time inference and excels in tasks requiring precise visual grounding, particularly in resource-constrained deployments such as robotics and embedded intelligent devices (Trinh et al., 14 Nov 2025).

1. Architecture and Model Components

Viper-F1 comprises three principal modules:

Grid-level Vision Encoder: Employs a SigLIP-based Vision Transformer to generate $G$ spatial embeddings, denoted $X_v \in \mathbb{R}^{G \times D_v}$ .
Cross-modal State-Space Modulator (CoSSM): Fuses token representations $X_t \in \mathbb{R}^{T \times D_t}$ with grid-level visual features using a single top- $k$ token-grid correlation, FiLM-based conditioning, and a structured state-space sequence model. This module implements linear-time sequential modeling and minimizes the reliance on expensive pairwise attention.
Unified Liquid-based LLM Decoder: Takes the fused multimodal features $X_{mm}$ and generates text autoregressively, aligned with standard LLM workflows.

The architecture omits Transformer-style $O((T + G)^2)$ cross-attention, instead replacing it with a one-time computation of token-grid correlations, two FiLM layers for context injection, and a state-space model for temporal (sequential) interactions (Trinh et al., 14 Nov 2025).

2. Liquid State-Space Dynamics and Sequence Modeling

Viper-F1’s CoSSM module employs the S4 variant of Structured State-Space Models (SSMs), where the state dynamics are defined by: $\frac{d h(t)}{dt} = A h(t) + B u(t), \quad y(t) = C h(t)$ Upon temporal discretization, the update equations are: $h_t = \bar{A} h_{t-1} + \bar{B} u_t, \quad y_t = \bar{C} h_t$ This formulation allows efficient 1-D convolutional processing with kernels $K_t = C A^t B$ . In practice, the SSM block propagates information in $O(T\, f(T))$ time, with $f(T) \in \{1, \log T\}$ , depending on the parameterization and implementation. All matrices and FiLM parameters are learned end-to-end (Trinh et al., 14 Nov 2025).

The SSM choice impacts performance, with S4 outperforming S4D and Mamba in ablation studies: S4 achieves a benchmark average of 47.3, compared to S4D’s 46.4 and Mamba’s 40.3 [(Trinh et al., 14 Nov 2025), Table 11].

The token-grid correlation module projects text and visual features into a shared latent space: $X_t' = W_t X_t, \quad X_v' = W_v X_v$ Features are split into $H$ heads for multi-headed modeling. The scaled dot-product correlation is computed as: $S_{mm} = \text{softmax}\left(\frac{X_t^H (X_v^H)^\top}{\sqrt{D_H}}\right)$ For each textual token, only the top- $k$ correlated visual grids are retained, yielding a sparse attention map. These are averaged across heads, and visual context vectors $c$ are computed as: $c = \sum_{g=1}^{G} \hat{S}_g X_v'$

FiLM conditioning is applied twice:

FiLM-in (pre-SSM): Modulates token stream prior to state-space modeling, with $\gamma_{in}, \beta_{in}$ learned from context $c$ .
FiLM-out (post-SSM): Further modulates SSM output with $\gamma_{out}, \beta_{out}$ conditioned on $c$ .

Final multimodal fusion applies a residual feed-forward network (FFN) and layer normalization, followed by global pooling to generate the compact embedding $z$ (Trinh et al., 14 Nov 2025).

4. Computational Efficiency and Complexity

The principal bottleneck in classical vision-LLMs is the quadratic cost of stacking cross-modal Transformer attention, which requires $O(T\,G\,D_t)$ computation and $O(T\,G)$ memory per layer. In contrast, Viper-F1’s CoSSM computes the token-grid correlation once, followed by efficient state-space modeling: $O(T\,G\,D_t) \ \text{for correlation}, \quad O(T\,D_t^2 + T\,D_t\,f(T)) \ \text{for SSM}$ The total complexity $O(T\,G\,D_t + T\,D_t^2 + T\,D_t\,f(T))$ is near-linear for small fixed $k$ , representing significant practical efficiency gains.

Measured on an NVIDIA H100, Viper-F1 achieves 40.08 ms latency and 46.67 tokens/s throughput. By comparison, MobileVLM-V2 records 46.04 ms/42.86 tokens/s, and SmolVLM 2 achieves 65.72 ms/37.80 tokens/s [(Trinh et al., 14 Nov 2025), Table 9].

5. Empirical Performance and Ablations

Viper-F1 (0.8B parameters) has been evaluated on multiple multimodal understanding benchmarks:

Benchmark	Score
VQAv2	76.6
POPE	69.4
AI2D	46.2
MMMU_val	26.4
MME $^p$	1376.2
SQA_Image	56.7
MMB_dev	64.6

On VQAv2, Viper-F1 achieves the highest accuracy among models sized 0.3–9B parameters. Ablation studies demonstrate that CoSSM fusion substantially outperforms alternatives such as 'Prepend' and 'Cross-attend' regarding fusion connector choice. Performance on visual grounding improves steadily up to $k=4$ for top- $k$ grid selection. Qualitative analysis indicates that Viper-F1 distinguishes subtle visual details—such as search-bar text and chart labels—more reliably than other efficient VLMs (Trinh et al., 14 Nov 2025).

6. Strengths, Limitations, and Future Directions

Strengths of Viper-F1 include linear-time inference that supports real-time operation on robotics and embedded platforms, improved visual grounding through token-grid–based fusion, and competitive accuracy relative to much larger models.

Limitations include restriction to single-image inputs—the extension of CoSSM to video or multi-image streams remains an open challenge. Additionally, there is a trade-off in top- $k$ grid selection and granularity, with increased detail potentially introducing noise. A plausible implication is that optimal grid partition strategies and adaptive $k$ selection might further improve robustness on dense visual tasks.

Future work may focus on dynamic grid partitioning and temporal state-space modulation for video-based understanding (Trinh et al., 14 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Viper-F1.