MIVA: Modular Image-to-Video Adapter

Updated 27 December 2025

MIVA is a parameter-efficient framework that extends pretrained image models into video by injecting lightweight, task-specific temporal adapters.
It employs bottleneck, spatio-temporal, and cross-modal modules to enable temporal dynamics, motion control, and cross-frame attention mechanisms.
MIVA drastically reduces training overhead while maintaining high performance across tasks like video retrieval, emotion recognition, and image-to-video generation.

A Modular Image-to-Video Adapter (MIVA) is a parameter-efficient technique for extending pretrained image models—often frozen Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), or diffusion backbones—into the video domain by injecting lightweight, task-specific plug-in modules. MIVA frameworks preserve the original backbone parameters while equipping them with temporal modeling, spatio-temporal reasoning, motion control, and cross-modal alignment via specialized adapters. This approach drastically reduces training and storage overhead compared to full fine-tuning while retaining or exceeding state-of-the-art performance on video understanding and generation tasks across text-video retrieval, video emotion recognition, video conversation, and image animation domains (Jin et al., 2023, Guo et al., 2023, Gowda et al., 2024, Liu et al., 2023, Li et al., 23 Dec 2025, Pan et al., 2022, Zhu et al., 30 Sep 2025).

1. Core Architectural Principles

MIVA operates by augmenting frozen image encoders (e.g., ViT, ResNet, DINOv2, CLIP, latent U-Nets) with small, trainable modules inserted at strategic locations—typically after feed-forward networks, before attention blocks, or in parallel branching paths. These adapters enable the image model to process video clips (sequence of frames) by introducing temporal dynamics through several key design patterns:

Bottleneck Adapter Modules: Bottleneck adapters down-project high-dimensional features ( $d\to d'$ ), apply lightweight fusion or temporal operations, and up-project back to the original dimension ( $d'\to d$ ), with residual addition for stability (Jin et al., 2023, Gowda et al., 2024, Pan et al., 2022).
Temporal Adaptation Modules: Adapters capture temporal dependencies by combining frame-level features via transformers, 1D temporal convolutions, Dynamic Dilated 3D Convolutions (D $^2$ Conv3D), or cross-frame attention, often producing per-frame calibrated upsampling weights (Jin et al., 2023, Gowda et al., 2024, Li et al., 23 Dec 2025, Guo et al., 2023).
Cross-Modality Tying (CMT): Parameter sharing is enforced across branches (vision and text) to promote modality alignment, for example via low-rank shared factors $f_C$ (Jin et al., 2023).
Branching Temporal Structures: Parallel temporal branches are attached alongside visual attention blocks, selectively gated and trained for video-specific cues (Liu et al., 2023).
Attention-based Motion Injection: In generative pipelines, adapters inject retrieved or learned motion priors into diffusion models through dedicated cross-attention layers, conditioning video synthesis on both appearance and motion (Zhu et al., 30 Sep 2025, Li et al., 23 Dec 2025, Guo et al., 2023).

2. Mathematical Formulation and Implementation

The typical MIVA pipeline involves the following formalism:

Adapter Block:

$A_\mathrm{basic}(x) = s \cdot \mathrm{TRM}(x W_\mathrm{down}) W_\mathrm{up}$

where $x \in \mathbb{R}^{N \times d}$ , $W_\mathrm{down} \in \mathbb{R}^{d \times d'}$ , $W_\mathrm{up} \in \mathbb{R}^{d' \times d}$ , and $\mathrm{TRM}$ is a lightweight transformer/attention mechanism (Jin et al., 2023).

Temporal Fusion:

$y = X + W_\uparrow \sigma(\mathrm{D}^2Conv3D(X W_\downarrow))$

where $W_\downarrow, W_\uparrow$ are projection matrices, and $\mathrm{D}^2Conv3D$ is a dynamic dilated 3D convolution, applied after down-projection (Gowda et al., 2024).

Cross-Frame Attention (diffusion-based):

$\mathrm{SA}_i^*(f^i; f^1, f^{i-1}) = \mathrm{SA}(f^i) + \lambda_2 CFA_{i,1} + \lambda_3 CFA_{i,i-1}$

$CFA_{i,j}(f^i; f^j) = \mathrm{Att}(f^iW_Q, f^jW_K, f^jW_V) W_O$

Dynamic weighting of residuals via an MLP $\phi$ ensures time-dependent control (Li et al., 23 Dec 2025).

Retrieval-Augmented Motion:

Motion features from similar videos are extracted, adapted with a causal transformer, and injected as adapters into a frozen U-Net:

$Z_i' = Z_i + \mathrm{Attention}(Q_i, K_i, V_i)$

where $K_i, V_i$ come from $\widehat{M}$ , the adapted motion tokens (Zhu et al., 30 Sep 2025).

Loss Functions:
- Contrastive InfoNCE for retrieval tasks:
$\ell = - \frac{1}{2N} \sum_{i=1}^N [ \log \frac{\exp(\mathrm{sim}(v_i, t_i)/\tau)}{\sum_{j}\exp(\mathrm{sim}(v_i, t_j)/\tau)} + \log \frac{\exp(\mathrm{sim}(v_i, t_i)/\tau)}{\sum_{j}\exp(\mathrm{sim}(v_j, t_i)/\tau)} ]$

(Jin et al., 2023, Liu et al., 2023) - Denoising Score Matching for video generation:

$\mathcal{L}_{\mathrm{denoise}} = \mathbb{E}_{x_0, \epsilon, t}\|\epsilon - (\epsilon_{\theta_0}(x_t, t, I) + \Delta_\phi(x_t, t))\|^2$

(Li et al., 23 Dec 2025, Guo et al., 2023).

Empirical ablations universally demonstrate that MIVA can operate with only 1–8% trainable parameters relative to backbone size, yet deliver results on par with full fine-tuning (Jin et al., 2023, Gowda et al., 2024, Pan et al., 2022).

3. Specializations: Adapter Variants and Temporal Mechanisms

MIVA encompasses a spectrum of adapter implementations tailored for distinct backbone architectures and tasks:

Spatio-Temporal Adapters: Incorporate 3D convolutions (e.g., Dynamic Dilated 3D Conv) or depth-wise 3D convolutions (DWConv3D) for simultaneous spatial and temporal information fusion (Gowda et al., 2024, Pan et al., 2022).
Temporal Convolution Adapters: Use 1D convs along the temporal axis for temporal aggregation at low parameter cost (Gowda et al., 2024).
Transformer-based Temporal Modules: Employ lightweight temporal self-attention, either within contextual branches (e.g., temporal transformer block, Motion Context Transformer in retrieval-based methods), or per-patch across all frames (Jin et al., 2023, Liu et al., 2023, Zhu et al., 30 Sep 2025).
Motion-Control Adapters for Diffusion Models: Implement cross-frame attention branches for identity preservation (I2V-Adapter), LoRA-based motion fusion (Few-Shot MIVA), and retrieval-augmented injection for zero-shot transfer (MotionRAG) (Guo et al., 2023, Li et al., 23 Dec 2025, Zhu et al., 30 Sep 2025).

A tabular summary of principal adapter types is presented below:

Adapter Type	Temporal Mechanism	Principal Backbone
Bottleneck Adapter	Light Transformer/Attention	CLIP/ViT/ResNet
Spatio-temporal Adapter	D $^2$ Conv3D, DWConv3D	ViT, ResNet
Branching Temporal Adapter	Parallel Transformer blocks	CLIP (late layers)
Cross-frame Attention	Attention over frame set	U-Net (Diffusion)
Motion-Adapter Injection	Attention + retrieval priors	Diffusion U-Net

4. Training Methodologies and Efficiency

MIVA frameworks universally exploit frozen pretrained image models, optimizing exclusively the adapters via supervised or contrastive objectives. Significant details include:

Hyperparameters: Typical learning rates range from $1 \times 10^{-5}$ to $5 \times 10^{-4}$ ; AdamW optimization and batch sizes 1–128 are used depending on dataset and hardware (Jin et al., 2023, Gowda et al., 2024, Guo et al., 2023, Pan et al., 2022).
Few-Shot Adaptation: Some variants, notably Few-Shot MIVA (Li et al., 23 Dec 2025), are designed for rapid training (∼10 videos, single consumer GPU, 3% parameter overhead), with residual fusion enabling compositional motion control.
Data and Augmentation: For retrieval and recognition, datasets like MSR-VTT, MSVD, LSMDC, DiDemo, ActivityNet, DFEW, FERV39K, MAFW are standard (Jin et al., 2023, Gowda et al., 2024, Liu et al., 2023). Video generation adapters train on WebVid-10M, in-house high-res clips, or bespoke few-shot collections (Guo et al., 2023, Li et al., 23 Dec 2025, Zhu et al., 30 Sep 2025).
Masking Strategies: Asymmetric token masking (masking 70% of spatial-temporal tokens in temporal branches) accelerates training and decreases resource consumption (Liu et al., 2023).
Parameter Efficiency: Adapter parameterizations range from 1–8% of the backbone; for example, CLIP-ViT-B/16 with d=768, L=12, adapter bottleneck d'=64, yields 2.4% overhead (Jin et al., 2023); spatio-temporal adapters on ViT use ∼1 M tunable parameters versus 86 M in full fine-tuning (Gowda et al., 2024).

5. Applications across Video Tasks

MIVA modules generalize effectively across video-related tasks:

Video Text Retrieval: MV-Adapter achieves state-of-the-art recall rates on five benchmarks with only 2.4% of parameters trained, strictly dominating or matching full fine-tuning and outperforming other PETL baselines (Jin et al., 2023).
Video Emotion Recognition: MIVA plugged into ResNet or ViT yields top performance on DFEW, FERV39K, MAFW, with ∼1 M parameters and outperforming heavier video models (Gowda et al., 2024).
Video Conversation and QA: Branching temporal adapters (BT-Adapter) facilitate zero-shot and instruction-tuned multimodal dialogue with substantial GPU/resource savings (Liu et al., 2023).
Image-to-Video Generation: MIVA, I2V-Adapter, and retrieval-augmented MotionRAG inject identity and motion priors into frozen video diffusion models, supporting high-fidelity generation, user control (motion compositionality), and zero-shot domain adaptation via database swap (Guo et al., 2023, Li et al., 23 Dec 2025, Zhu et al., 30 Sep 2025).
Generalization and Extensibility: Plug-in adapters enable flexible adaptation for downstream tasks—video action recognition, object segmentation, QA, captioning—by substituting classifier heads or integrating cross-attention adapters (Gowda et al., 2024).

Empirical results are summarized below for retrieval and recognition tasks:

Benchmark	Method	Params (%)	T2V R@1	T2V R@Sum	V2T R@1	V2T R@Sum	UAR (DFEW)	WAR (DFEW)
MSR-VTT	Full tune	100	45.0	200.2	45.3	202.4	-	-
MSR-VTT	MV-Adapter	2.4	46.2	202.1	47.2	205.9	-	-
DFEW	MIVA (ViT)	1.0	-	-	-	-	61.5	74.3
DFEW	FE-Adapter	7.7	-	-	-	-	60.9	73.7

6. Recent Extensions and Domain Adaptation

Contemporary work has advanced MIVA toward retrieval-augmented and compositional frameworks:

MotionRAG integrates k-NN video retrieval, motion resampling, causal motion-context transformers, and motion attention adapters, enabling in-context learning of motion priors and zero-shot domain transfer—requiring only a database update for new domains (Zhu et al., 30 Sep 2025).
Few-Shot Compositional MIVA enables modular motion patterns, compositionality at inference, and mask-guided control, handling domains with scarce labeled video by training adapter banks (∼10 samples/motion) (Li et al., 23 Dec 2025).
I2V-Adapter and Frame Similarity Priors facilitate fine-grained identity preservation and motion-stability trade-offs for diffusion-based image animation, maintaining compatibility with community personalization and control tools (Guo et al., 2023).

Suggested customizations involve scaling adapter depth, dilation, cross-modal pooling, LoRA reparameterization for further efficiency, and domain-specific encoder tuning for sharper priors (Zhu et al., 30 Sep 2025, Gowda et al., 2024).

7. Limitations and Future Research

While MIVA architectures dramatically improve efficiency and flexibility, several limitations persist:

Adapter Placement: Empirical results favor pre-attention injection for temporal fusion, but optimal placement may depend on downstream application or backbone architecture (Gowda et al., 2024, Pan et al., 2022).
Temporal Scope: Most MIVA variants are constrained by the base model's temporal window (e.g., 16–32 frames); extension to longer sequences requires architectural scaling.
Motion Diversity: For image-to-video generation, motion priors retrieved or learned may not cover all domain-specific patterns without large, diverse databases or combinatorial adapter fusion.
Adapter Scaling: Trade-offs exist between bottleneck size, temporal kernel shape, adapter depth, and performance; future work may explore conditional gating or lightweight self-attention within adapters (Pan et al., 2022).
Integration with Vision-LLMs: Extending MIVA techniques for unsupervised or cross-modal video-text pre-training remains an open direction (Liu et al., 2023, Zhu et al., 30 Sep 2025).

MIVA constitutes a unifying framework for parameter-efficient cross-modal transfer from images to videos, driving advances in both discriminative and generative video AI methodologies with broad adaptability across architectures, tasks, and domains.