PandaGPT: Modular Multimodal LLM

Updated 2 June 2026

PandaGPT is a modular multimodal LLM that integrates visual, audio, and sensor data using frozen pretrained encoders and lightweight connectors.
It uses image–text instruction fine-tuning to enable zero-shot cross-modal reasoning and emergent capabilities across diverse modalities.
Its architecture offers strong compositional features but also presents supply-chain and backdoor security risks via its vulnerable connector module.

PandaGPT is a modular multimodal LLM (MLLM) paradigm that unifies visual, auditory, and several sensor modalities with strong LLMs to enable holistic, instruction-following reasoning across modalities. The canonical design uses frozen pretrained encoders, lightweight projection modules, and a LLM with minimal trainable parameters. Its architecture and alignment techniques enable emergent cross-modal capabilities, while its compositional modularity introduces significant supply-chain and security implications within the current MLLM landscape (Su et al., 2023, Wang et al., 8 May 2026, Liu et al., 2024).

1. Model Architecture and Parameterization

PandaGPT assembles three primary modules to enable simultaneous multimodal understanding and instruction following:

Multimodal Encoder (ImageBind): All input forms (image, audio, video, depth, thermal, IMU) are embedded into a shared $d_e$ -dimensional latent space via pretrained, frozen ImageBind encoders; for ImageBind, $d_e = 1024$ .
Connector/Projection Head: A lightweight, trainable linear or MLP module $f_{\rm conn}$ (typically a matrix $W \in \mathbb{R}^{d_{\text{LLM}} \times d_e}$ ) aligns encoder outputs to the LLM input space: $z = f_{\rm conn}(h)$ where $h$ is the ImageBind embedding, $d_{\text{LLM}} = 4096$ for Vicuna-13B.
LLM (Vicuna): A predominately frozen model (e.g., Vicuna-13B), sometimes augmented with trainable LoRA adapters inserted in the self- and cross-attention layers (≈0.4% of model parameters). Only the connector and LoRA weights are updated during supervised finetuning (Su et al., 2023).

The overall inference pipeline embeds the raw input using ImageBind, projects via $f_{\rm conn}$ , and prepends the resulting vector to the text prompt for the LLM. All backbone encoders and the LLM remain frozen, facilitating efficient adaptation and strong transfer.

2. Training Regimen and Data

PandaGPT's core training protocol uses supervised next-token prediction over multi-turn, image–text dialog pairs. The key aspects include:

Data: The model is trained on approximately 160,000 image–text instruction–response pairs sourced from instruction-following datasets such as LLaVA and Mini-GPT4. Each training instance is a multi-turn dialog paired with a single image. No explicit audio–text, video–text, or other cross-modal aligned pairs are used during training (Su et al., 2023).
Objective: Standard autoregressive cross-entropy loss on assistant response tokens, conditioning on the projected multimodal embedding and text context; explicitly,

$\mathcal{L}(W, \theta_{\rm LoRA}) = - \sum_{i=1}^n \sum_{t=1}^{T_i} \log p_\phi(y_{i, t} \mid x_i, y_{i,<t}, z)$

where $z = f(h)$ is the projected embedding, and $d_e = 1024$ 0 includes both connector and LoRA parameters.

Compute: Training occurs on eight A100 GPUs for ≈7 hours across two epochs; batch size $d_e = 1024$ 1– $d_e = 1024$ 2 per GPU, AdamW optimizer, learning rate $d_e = 1024$ 3 with linear decay to zero (Su et al., 2023).

No contrastive, alignment, or reinforcement learning objectives are involved; all cross-modal generalization is inherited via the pretrained ImageBind latent space.

PandaGPT demonstrates a range of qualitative instruction-following capabilities over multiple modalities despite its training solely on image–text data:

Image/Video QA: Accurate generation of scene descriptions, object identification, and event reasoning over stills and videos.
Audio Grounding: Generation of narratives and response to audio-only prompts, including description and classification (e.g., "barking dogs," "gunshots") (Su et al., 2023).
Multimodal Composition: Prompts combining disparate modalities ("describe this image and sound together" or "image + audio") produce text referencing both sensory streams—enabled by unified latent representations.
Zero-Shot Generalization: With no direct finetuning or paired data, the model extends to depth maps, thermal images, and IMU sensor readings for classification and natural language generation.
Limitations: Known deficiencies include hallucinated content, coarse grounding due to global (not region/time-specific) embeddings, and a lack of generative capacity for non-text outputs (Su et al., 2023).

Emergent behaviors are attributed to the shared latent space imposed by the pretrained multimodal encoder, allowing for composition and semantic transfer in the LLM.

PandaGPT’s modular and connector-based design exposes unique attack surfaces:

Connector Threat Surface: The projection head ( $d_e = 1024$ 4), with $d_e = 1024$ 5M parameters ( $d_e = 1024$ 6 of the backbone), is a high-leverage supply-chain attack point. Poisoning only the connector can implant a latent-space backdoor that is reachable by inputs from any modality (Wang et al., 8 May 2026).
Cross-Modal Backdoor Attack: Poisoning the connector using a small set of seed and augmented samples from one modality establishes a latent anchor $d_e = 1024$ 7. An adversary can then use input-side optimization (e.g., PGD) on any other modality to steer its embedding toward $d_e = 1024$ 8, reliably invoking the malicious response $d_e = 1024$ 9. This process is mathematically formalized as:

$f_{\rm conn}$ 0

where $f_{\rm conn}$ 1.

Empirical Findings:
- Attack success rate (ASR) up to $f_{\rm conn}$ 2 (image-only trigger), $f_{\rm conn}$ 3– $f_{\rm conn}$ 4 (cross-modal, image $f_{\rm conn}$ 5audio/text), and similarly high ratios for audio/text triggers.
- Utility on clean inputs is preserved (BLEU-4 drop $f_{\rm conn}$ 6; clean leakage $f_{\rm conn}$ 7).
- Existing defenses (fine-tuning, pruning, input transformation) are largely ineffective without substantial clean utility degradation (Wang et al., 8 May 2026).

Implication: The use of shared latent spaces and lightweight connectors, while facilitating modularity and cross-modal generalization, fundamentally enables such cross-modal backdoor pathways.

5. PANDA: Plug-in Preference Adaptation for Domain Specialization

The PANDA methodology presents a general framework for non-gradient, domain-specific alignment of LLM-based agents, directly applicable to architectures such as PandaGPT (Liu et al., 2024):

Insight Pool Construction: For each expert query $f_{\rm conn}$ 8, extract top- $f_{\rm conn}$ 9 preference pairs $W \in \mathbb{R}^{d_{\text{LLM}} \times d_e}$ 0 from the expert (e.g., RoBERTa or Flan-T5), and prompt the (frozen) LLM to generate an "insight"—a natural language rationale for the expert's preference of $W \in \mathbb{R}^{d_{\text{LLM}} \times d_e}$ 1 over $W \in \mathbb{R}^{d_{\text{LLM}} \times d_e}$ 2.
Embedding and Retrieval: Each $W \in \mathbb{R}^{d_{\text{LLM}} \times d_e}$ 3 pair is indexed via embedding $W \in \mathbb{R}^{d_{\text{LLM}} \times d_e}$ 4. During inference, the $W \in \mathbb{R}^{d_{\text{LLM}} \times d_e}$ 5 nearest insights to the query are appended as context to condition the LLM's output.
Inference Prompt Example:

$W \in \mathbb{R}^{d_{\text{LLM}} \times d_e}$ 6

Strengths: PANDA is entirely tuning-free and works with closed-source LLMs, leveraging expert rationales as in-context prompts rather than updating weights.
Performance: Empirically, PANDA-augmented LLMs can surpass both vanilla LLMs and, in some tasks, the domain expert itself on interactive decision making (ScienceWorld) and text classification (TweetEval).
Limitations: Performance depends on insight retrieval quality and LLM instruction-following strength; scaling insight pools may necessitate more efficient retrieval schemes (Liu et al., 2024).

Editor's term: "PandaGPT (PANDA-style GPT)" is sometimes used for GPT/LLM models enhanced with the PANDA non-parametric adaptation mechanism.

6. System Limitations and Areas for Extension

Distinct limitations are observed in both core PandaGPT-style construction and security stance:

Grounding Granularity: Reliance on a global per-modality embedding restricts fine-grained attention to specific image regions or audio segments.
Lack of Full-Multimodal Outputs: The architecture is unable to generate outputs (e.g., images, audio) outside of text modality as currently implemented (Su et al., 2023).
Benchmarks: No rigorous, quantitative compositional or cross-modal understanding benchmarks are yet established; most results are qualitative.
Security: Connector-focused attacks can render the model vulnerable independently of encoder or LLM integrity.
Extension Pathways: Authors suggest integration of additional aligned multimodal–text pairs, finer-grained embedding mechanisms (cross-modal attention, multiple embeddings), full-multimodal output decoders, improved security auditing on connectors, and systematic benchmarking as future work (Su et al., 2023, Wang et al., 8 May 2026, Liu et al., 2024).

7. Recommendations for Deployment and Research

To deploy PandaGPT and similar systems securely and effectively, several measures are advised:

Connector Supply-Chain Hardening: Employ cryptographic signing, provenance checks, and representation auditing for third-party connectors (Wang et al., 8 May 2026).
Representation-Level Regularization: During alignment, discourage excessively narrow latent basins to thwart stable backdoor centroids.
Insight Retrieval Efficiency: For PANDA-style adaptation, design retrieval and caching strategies to maintain scalability as insight pools grow (Liu et al., 2024).
Comprehensive Benchmarks: Initiate construction of qualitative and quantitative evaluation suites for cross-modal instruction-following and compositional reasoning.
Routine Re-alignment: Apply periodic retraining on clean, diverse data to reduce drift and minimize latent-space pockets susceptible to attack.
Ensemble or Randomized Connectors: To raise the bar for successful exploitation, use multiple connector instantiations or inject stochasticity.

This constellation of engineering, algorithmic, and security practices remains vital as PandaGPT-style MLLMs are adopted for increasingly complex, real-world multimodal applications.