Papers
Topics
Authors
Recent
Search
2000 character limit reached

Viper-F1: Hybrid State-Space Vision-Language Model

Updated 21 November 2025
  • Viper-F1 is a hybrid state-space vision-language model that efficiently replaces quadratic cross-attention with token-grid correlation and structured state-space dynamics.
  • It integrates a SigLIP-based Vision Transformer, a CoSSM module with FiLM conditioning, and a liquid LLM decoder for streamlined multimodal processing.
  • Empirical evaluations demonstrate competitive accuracy and reduced latency, making it well-suited for real-time applications in robotics and embedded systems.

Viper-F1 is a hybrid state-space vision-LLM designed for fine-grained multimodal understanding with high computational efficiency. It departs from standard Transformer-based architectures by replacing quadratic-complexity cross-attention mechanisms with structured liquid state-space dynamics and lightweight cross-modal fusion. Viper-F1 integrates a SigLIP-based Vision Transformer (ViT) encoder, a Cross-modal State-Space Modulator (CoSSM) utilizing token-grid correlations and Feature-wise Linear Modulation (FiLM), and a liquid-based LLM decoder to yield accurate, detail-oriented text generation given visual-textual prompts. The resulting model enables linear-time inference and excels in tasks requiring precise visual grounding, particularly in resource-constrained deployments such as robotics and embedded intelligent devices (Trinh et al., 14 Nov 2025).

1. Architecture and Model Components

Viper-F1 comprises three principal modules:

  1. Grid-level Vision Encoder: Employs a SigLIP-based Vision Transformer to generate GG spatial embeddings, denoted XvRG×DvX_v \in \mathbb{R}^{G \times D_v}.
  2. Cross-modal State-Space Modulator (CoSSM): Fuses token representations XtRT×DtX_t \in \mathbb{R}^{T \times D_t} with grid-level visual features using a single top-kk token-grid correlation, FiLM-based conditioning, and a structured state-space sequence model. This module implements linear-time sequential modeling and minimizes the reliance on expensive pairwise attention.
  3. Unified Liquid-based LLM Decoder: Takes the fused multimodal features XmmX_{mm} and generates text autoregressively, aligned with standard LLM workflows.

The architecture omits Transformer-style O((T+G)2)O((T + G)^2) cross-attention, instead replacing it with a one-time computation of token-grid correlations, two FiLM layers for context injection, and a state-space model for temporal (sequential) interactions (Trinh et al., 14 Nov 2025).

2. Liquid State-Space Dynamics and Sequence Modeling

Viper-F1’s CoSSM module employs the S4 variant of Structured State-Space Models (SSMs), where the state dynamics are defined by: dh(t)dt=Ah(t)+Bu(t),y(t)=Ch(t)\frac{d h(t)}{dt} = A h(t) + B u(t), \quad y(t) = C h(t) Upon temporal discretization, the update equations are: ht=Aˉht1+Bˉut,yt=Cˉhth_t = \bar{A} h_{t-1} + \bar{B} u_t, \quad y_t = \bar{C} h_t This formulation allows efficient 1-D convolutional processing with kernels Kt=CAtBK_t = C A^t B. In practice, the SSM block propagates information in O(Tf(T))O(T\, f(T)) time, with f(T){1,logT}f(T) \in \{1, \log T\}, depending on the parameterization and implementation. All matrices and FiLM parameters are learned end-to-end (Trinh et al., 14 Nov 2025).

The SSM choice impacts performance, with S4 outperforming S4D and Mamba in ablation studies: S4 achieves a benchmark average of 47.3, compared to S4D’s 46.4 and Mamba’s 40.3 [(Trinh et al., 14 Nov 2025), Table 11].

3. Cross-modal Token-Grid Correlation and FiLM Fusion

The token-grid correlation module projects text and visual features into a shared latent space: Xt=WtXt,Xv=WvXvX_t' = W_t X_t, \quad X_v' = W_v X_v Features are split into HH heads for multi-headed modeling. The scaled dot-product correlation is computed as: Smm=softmax(XtH(XvH)DH)S_{mm} = \text{softmax}\left(\frac{X_t^H (X_v^H)^\top}{\sqrt{D_H}}\right) For each textual token, only the top-kk correlated visual grids are retained, yielding a sparse attention map. These are averaged across heads, and visual context vectors cc are computed as: c=g=1GS^gXvc = \sum_{g=1}^{G} \hat{S}_g X_v'

FiLM conditioning is applied twice:

  • FiLM-in (pre-SSM): Modulates token stream prior to state-space modeling, with γin,βin\gamma_{in}, \beta_{in} learned from context cc.
  • FiLM-out (post-SSM): Further modulates SSM output with γout,βout\gamma_{out}, \beta_{out} conditioned on cc.

Final multimodal fusion applies a residual feed-forward network (FFN) and layer normalization, followed by global pooling to generate the compact embedding zz (Trinh et al., 14 Nov 2025).

4. Computational Efficiency and Complexity

The principal bottleneck in classical vision-LLMs is the quadratic cost of stacking cross-modal Transformer attention, which requires O(TGDt)O(T\,G\,D_t) computation and O(TG)O(T\,G) memory per layer. In contrast, Viper-F1’s CoSSM computes the token-grid correlation once, followed by efficient state-space modeling: O(TGDt) for correlation,O(TDt2+TDtf(T)) for SSMO(T\,G\,D_t) \ \text{for correlation}, \quad O(T\,D_t^2 + T\,D_t\,f(T)) \ \text{for SSM} The total complexity O(TGDt+TDt2+TDtf(T))O(T\,G\,D_t + T\,D_t^2 + T\,D_t\,f(T)) is near-linear for small fixed kk, representing significant practical efficiency gains.

Measured on an NVIDIA H100, Viper-F1 achieves 40.08 ms latency and 46.67 tokens/s throughput. By comparison, MobileVLM-V2 records 46.04 ms/42.86 tokens/s, and SmolVLM 2 achieves 65.72 ms/37.80 tokens/s [(Trinh et al., 14 Nov 2025), Table 9].

5. Empirical Performance and Ablations

Viper-F1 (0.8B parameters) has been evaluated on multiple multimodal understanding benchmarks:

Benchmark Score
VQAv2 76.6
POPE 69.4
AI2D 46.2
MMMU_val 26.4
MMEp^p 1376.2
SQA_Image 56.7
MMB_dev 64.6

On VQAv2, Viper-F1 achieves the highest accuracy among models sized 0.3–9B parameters. Ablation studies demonstrate that CoSSM fusion substantially outperforms alternatives such as 'Prepend' and 'Cross-attend' regarding fusion connector choice. Performance on visual grounding improves steadily up to k=4k=4 for top-kk grid selection. Qualitative analysis indicates that Viper-F1 distinguishes subtle visual details—such as search-bar text and chart labels—more reliably than other efficient VLMs (Trinh et al., 14 Nov 2025).

6. Strengths, Limitations, and Future Directions

Strengths of Viper-F1 include linear-time inference that supports real-time operation on robotics and embedded platforms, improved visual grounding through token-grid–based fusion, and competitive accuracy relative to much larger models.

Limitations include restriction to single-image inputs—the extension of CoSSM to video or multi-image streams remains an open challenge. Additionally, there is a trade-off in top-kk grid selection and granularity, with increased detail potentially introducing noise. A plausible implication is that optimal grid partition strategies and adaptive kk selection might further improve robustness on dense visual tasks.

Future work may focus on dynamic grid partitioning and temporal state-space modulation for video-based understanding (Trinh et al., 14 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Viper-F1.