Instruction-conditioned Q-Former (IQF)

Updated 6 October 2025

The paper introduces IQF, which incorporates natural language instructions via FiLM modulation and query-based cross-attention for effective multi-modal integration.
IQF aligns sensory data and semantic priors using cosine similarity loss, achieving state-of-the-art performance across multiple datasets.
The architecture supports rapid transfer learning and cross-domain generalization, making it scalable for neurotechnology, computer vision, and other applications.

The Instruction-conditioned Q-Former (IQF) is an advanced transformer-based architecture designed to integrate natural language instructions into neural and multi-modal representation spaces. IQF is used as a query-based cross-attention module that aligns physiological signals (such as EEG), visual features, or other modality embeddings with semantic priors derived from task-level or programmatic instructions. By bridging sensory data and language semantics, IQF enables adaptable, instruction-tuned reasoning and decoding in neurotechnology, computer vision, and beyond.

1. Instruction-conditioning Mechanism and Architecture

IQF incorporates explicit task instructions by processing natural language inputs through a frozen language encoder, such as BERT or SBERT, yielding instruction embeddings ( $e_\text{ins}$ ). These embeddings are injected into modality-specific tokens (such as EEG feature sequences or image patch embeddings) using Feature-wise Linear Modulation (FiLM). The modulation parameters are computed as:

$(\gamma, \beta) = \tanh(W_{\gamma\beta} e_\text{ins} + b_{\gamma\beta})$

$(\tilde{m} = \gamma \odot m + \beta)$

where $m$ is the pre-modulation token sequence, $\gamma$ and $\beta$ are the FiLM parameters, and $\odot$ denotes element-wise multiplication.

A core component is query-based cross-attention, wherein a bank of learnable query vectors ( $Q_0 \in \mathbb{R}^{N_q \times d}$ ) interacts with the conditioned tokens:

$Q = \text{softmax}\left(\frac{Q_0 W_Q (\tilde{m} W_K)^T}{\sqrt{d}}\right) (\tilde{m} W_V)$

with $W_Q$ , $W_K$ , and $W_V$ as projection matrices, and $d$ as the key dimension. The cross-attention operation projects high-dimensional sequence representations into latent query subspaces, facilitating extraction of task-relevant features.

The attended outputs are aggregated, typically via MLPs, yielding a task-adapted summary ( $h$ ) of the signal that is infused with instruction-dependent semantic content.

2. Semantic Alignment and Loss Formulation

After obtaining $h$ , the IQF aligns these refined representations with language-based class prototypes. For each class $c$ , a prototype embedding $e_\text{tgt}^{(c)}$ is extracted using the same language encoder. The alignment objective employs cosine similarity:

$\mathcal{L}_\text{align} = \frac{1}{|\mathcal{C}|} \sum_{c\in\mathcal{C}} \left[1 - \cos(h, e_\text{tgt}^{(c)})\right]\cdot \mathbb{I}[y=c]$

where $\mathbb{I}[y=c]$ is the indicator for the target label. This enforces that modality representations and target instructions/labels are mapped to a common semantically coherent space, conferring robustness and transferability across subjects and datasets.

3. Performance Across Modalities and Benchmarks

IQF has demonstrated state-of-the-art performance in EEG-language alignment within the foundation model framework ELASTIQ (Jiang et al., 29 Sep 2025). Applied to 20 datasets spanning motor imagery, emotion recognition, SSVEP, covert speech, and healthcare, the IQF-based approach yielded:

Macro-accuracy: 66.78%
Cohen’s Kappa: 53.91%
Best results on 14/20 datasets and all five task categories

These outcomes establish that explicit instruction-conditioning via IQF improves decoding robustness, accelerates convergence, and delivers consistent gains in both within-domain and cross-domain scenarios.

In visual-language tasks, Q-Former modules with instruction-conditioning or parameter-efficient tuning (PEFT, AdaLoRA) efficiently align multi-modal features for reasoning and generation (Kim et al., 12 Oct 2024, Fang et al., 2023). Instruction-tuned Q-Formers achieve competitive benchmark performance at a fraction of the trainable parameter cost, highlighting the method’s suitability for scalable, resource-constrained deployment.

4. Cross-domain Adaptivity and Generalization

IQF’s architecture enables rapid transfer learning and generalization across tasks by infusing task-level priors. This is achieved through semantic guidance—where instruction embeddings modulate representation spaces—and latent queries adaptively filter features that are most relevant to the current task. Empirical analyses demonstrate robustness to inter-subject variability and facilitate cross-task adaptation, supporting universal models that require minimal retraining when transitioning to new cognitive, clinical, or sensory domains.

A plausible implication is that broadening the instruction-conditioned paradigm within Q-Former-like modules could accelerate multimodal AI systems, allowing for seamless switching and aggregation of task behaviors based solely on natural language directives.

5. Comparative Evaluation and Baseline Analysis

The efficacy of IQF is quantified against alternative reward functions and baseline architectures. For controlled program generation with symbolic instruction (e.g., visual scene synthesis), instruction-conditioned agent policies outperform fixed pixel-based L2 reward baselines in metrics such as Inception Score (1.39 vs. 1.22), Fréchet Inception Distance (259.7 vs. 283.5), correctness, and diversity (Agrawal et al., 2018). In multi-modal alignment tasks, instruction-conditioned Q-Formers favor self-attention layers for perceptual reasoning and allocate resources to feed-forward layers for complex textual reasoning, as revealed by AdaLoRA’s dynamic budget reallocation (Kim et al., 12 Oct 2024).

This suggests modular architectures with explicit policy and reward conditioning are essential for capturing diverse goal distributions and naturally variable outputs in instruction-driven generative and reasoning settings.

Model/System	Modality	Instruction Channel	Semantic Alignment Loss
IQF (ELASTIQ)	EEG	BERT/SBERT	$\mathcal{L}_\text{align}$ (cosine)
Q-Former (InstructBLIP)	Vision/Text	Prompt/LLM	Task-specific (cross-entropy)
InstructSeq	Vision/Text	RoBERTa/LLM	Token-wise cross-entropy

6. Applications and Forward-looking Implications

IQF introduces semantic task guidance into neural and multi-modal foundation models, enabling:

Adaptive BCIs, where instructions steer brain-state decoding for user-centric device control and neurofeedback.
Robust multimodal fusion, extendable to other physiological signals for foundation models in clinical and assistive domains.
General-purpose reasoning and generative systems in computer vision, leveraging instruction-driven output generation (e.g., segmentation, captioning) without task-specific retraining (Fang et al., 2023).
Enhanced interpretability, enabling neurophysiological mapping that remains anchored to semantic labels accessible in natural language.
Transfer learning, facilitating universal models that generalize across cognitive tasks and cross-population evaluations.

A plausible implication is that further refinement of IQF and related instruction-conditioned architectures will yield flexible, efficient, and modular AI systems for neuroscience, clinical applications, and broad multimodal reasoning, making semantic guidance a foundational principle for future models.