Language-Conditioned Dynamic Conv Head (LDCH)

Updated 27 December 2025

The paper presents that LDCH generates convolution parameters dynamically from language embeddings, significantly enhancing spatial decoding in navigation and manipulation tasks.
LDCH is a neural module that adapts visual feature extraction by conditioning convolution filters on natural language, facilitating fine-grained, context-aware processing.
Empirical results demonstrate that LDCH boosts performance, with marked improvements in navigation success rates and grasp detection accuracy compared to static filter baselines.

A Language-Conditioned Dynamic Convolution Head (LDCH) is a neural module that generates and applies dynamic convolutional kernels conditioned on natural language input, enabling fine-grained and context-sensitive integration of linguistic cues into visual or multimodal representations. LDCH architectures have been successfully deployed in embodied vision-and-language navigation (Landi et al., 2019), manipulation tasks such as language-guided grasp detection (Jiang et al., 24 Dec 2025), and recently as a component in mixed-attention LLMs (Jiang et al., 2020). The common innovation is the generation of convolution parameters as a function of a sentence or instruction embedding, allowing the model to adapt its spatial reasoning or local feature decoding on a per-query basis.

1. Mathematical Formulation and Architectural Instantiations

The LDCH is a parametrized convolution layer in which some or all convolution parameters are produced dynamically in response to embedded natural language context. The core mechanism involves mapping a language-derived feature—typically an encoded instruction vector $\ell$ or sentence embedding $s_i$ —to a set of convolution weights via a filter-generation network or mixture module. The generated dynamic convolution is then applied to visual or multimodal feature tensors.

In "Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters" (Landi et al., 2019), the LDCH operates as follows:

A language query $\ell \in \mathbb{R}^d$ is generated via attention over the sequence of LSTM-encoded instruction embeddings.
A filter-generator network $f(\cdot;\Theta)$ maps this query to a bank of $F$ dynamic $1\times1$ convolutions:

$W(\ell) = f(\ell;\Theta) \in \mathbb{R}^{F \times C \times 1 \times 1}$

The dynamic filters are convolved with a panoramic image feature map $X \in \mathbb{R}^{C \times H \times W}$ , yielding:

$Y = X * W(\ell) \in \mathbb{R}^{F \times H \times W}$

with output normalized by $\sqrt{C}$ to maintain scale consistency.

In "Language-Guided Grasp Detection with Coarse-to-Fine Learning for Robotic Manipulation" (Jiang et al., 24 Dec 2025), each per-task head (mask, quality, angle, width) maintains a discrete expert bank of $K$ learnable $3 \times 3$ convolution kernels. Sentence embedding $s_i$ is mapped via an MLP to a softmax-derived mixture $\pi_i \in \Delta^K$ , producing a per-sample dynamic kernel:

$W_{i,t} = \sum_{k=1}^K \pi_{i,k} W_t^{(k)}$

The dynamic convolution is applied via grouped-convolution over expanded (and sliced) feature tensors.

2. Integration into Multimodal Policy and Perception Pipelines

LDCH modules have been integrated as critical components in both closed-loop policy networks and hierarchical perception systems.

In low-level embodied navigation (Landi et al., 2019), language-conditioned dynamic responses are flattened and input along with the previous action (one-hot) into a policy LSTM, whose output parameterizes distributions over fine-grained navigation actions (turn left/right, adjust elevation, move forward, stop). In LGGD for robotic manipulation (Jiang et al., 24 Dec 2025), the LDCH serves as the coarse prediction head, providing instruction-adaptive spatial masks and grasping affordance maps that are then refined by later modules.

The following table summarizes the primary pipeline interfaces for LDCH in both domains:

Application Domain	LDCH Input Modalities	LDCH Output Targets
Embodied Navigation	Panoramic visual grid, language instruction	Dynamic grid features $\to$ policy LSTM
Language-guided Manipulation	Fused CLIP features, sentence embedding	Mask, grasp quality, angle, width maps

In both cases, gradients from downstream losses (navigation success, mask/grasp loss) flow back through the dynamic convolution head and into the textual and visual representations, ensuring that language-visual alignment evolves toward task-specific supervision signals.

3. Dynamic Filter Generation and Learned Mixture Mechanisms

The technical approaches to dynamic kernel synthesis differ by architecture but share core principles:

Direct Conditioning: In navigation (Landi et al., 2019), a fully-connected + $\tanh$ + L2 normalization layer directly maps the attention-extracted instruction state to dynamic kernel weights. A fixed low-dimensional kernel bank ( $F=4$ ) stabilizes capacity.
Expert Mixture: In manipulation (Jiang et al., 24 Dec 2025), an MLP-gated softmax produces a mixture over $K$ expert kernels per task head, supporting both efficient specialization and reuse. Dynamic kernel and bias are weighted sums of expert parameters:

$W_{i,t} = \sum_{k=1}^K \pi_{i,k}W_t^{(k)}, \qquad b_{i,t} = \sum_{k=1}^K \pi_{i,k}b_t^{(k)}$

This supports efficient, query-dependent spatial decoding with a small set of underlying expert kernels.

Interpretation of this mechanism suggests two benefits: instruction adaptivity (the ability to attend specifically to language-relevant spatial features) and efficient specialization (a static expert bank is sufficient for a wide array of language-driven decodings) (Jiang et al., 24 Dec 2025).

4. Training Procedures, Supervision, and Regularization

Empirical LDCH training regimes adhere to end-to-end backpropagation, with supervision and optimization tailored to domain objectives.

Navigation (Landi et al., 2019): Cross-entropy loss over low-level actions, Adam optimizer (learning rate $10^{-3}$ , batch size 128), dropout ( $p=0.5$ ) applied at multiple points. Language guidance is strictly implicit; instruction gradients flow from action losses back through all modules.
Manipulation (Jiang et al., 24 Dec 2025): Losses comprise weighted binary cross-entropy (mask) and Smooth L1 (grasp parameters) applied to the coarse LDCH predictions, summed with downstream refinement loss:

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{refine}} + \lambda \mathcal{L}_{\text{coarse}}$

( $\lambda = 0.5$ ), AdamW optimizer with cosine annealing and $\ell_2$ weight decay (0.06).

No additional auxiliary loss is required by LDCH; the main loss for the full task suffices due to full differentiability and gradient flow to all LDCH parameters.

5. Empirical Impact and Ablation Results

Empirical studies across navigation and manipulation substantiate the effect of dynamic, language-conditioned convolution compared to static filter baselines:

Navigation (Landi et al., 2019): LDCH yields substantial improvements in success rate, navigation error, and oracle success rate:
- Val-Seen: Success Rate increases by 14.5 points (38.6 % → 53.1 %)
- Val-Unseen: Success Rate increases by 9.8 points (21.8 % → 31.6 %)
- With large-scale data augmentation, the LDCH achieves 35.0 % SR and 31.0 % SPL on the official test-unseen set, outperforming prior low-level action approaches by approximately +15 and +13 points, respectively.
Manipulation (Jiang et al., 24 Dec 2025): On the OCID-VLG dataset, direct replacement of static coarse decoders with LDCH yields absolute gains:
- Intersection over Union (IoU) increases by +0.85 (82.29 % → 83.14 %)
- Jaccard index at best threshold (J@1) increases by +0.83 (84.53 % → 85.36 %)
- These improvements manifest as sharper spatial mask prediction and better alignment of grasp parameters to diverse natural language prompts in both synthetic and real-world evaluation.

6. LDCH Variants and Context within Dynamic Convolution

Not all modules performing language-conditioned dynamic convolution are designated "LDCH"; related mechanisms under alternate nomenclature include the span-based dynamic convolution in ConvBERT (Jiang et al., 2020), which generates per-position convolutional kernels for local span modeling in language, based on element-wise operations between learned query and "span-aware" key projections followed by a learned kernel synthesizer. In ConvBERT, these dynamic convolution heads are integrated with standard self-attention heads in a mixed-attention block, demonstrating efficiency gains (≈25% reduction in per-layer compute), cost-effective pre-training, and robust downstream task performance.

While ConvBERT’s dynamic convolution is not language-conditioned in the multimodal sense, the technical lineage overlaps with LDCH in its method of generating convolution parameters dynamically from context.

7. Interpretation, Specialization Efficiency, and Applicability

The introduction of LDCH modules in both navigation and manipulation establishes new functional capacities for instruction-sensitive visual grounding:

The ability to modulate feature decoding per instruction allows systems to discriminate between highly distinct queries (e.g., "pick the red mug" vs. "grab the silver spoon") by assembling bespoke convolutional filters.
By leveraging a compact set of learnable expert kernels (as in (Jiang et al., 24 Dec 2025)), models realize a balance between specialization for task diversity and parameter efficiency.
The flexible integration of LDCH into upstream pipelines renders it compatible with cross-modal fusion backbones (e.g., CLIP, FiLM, attention LSTMs) and with downstream decoders (LSTM policies, grouped convolutions).

A plausible implication is that the LDCH framework establishes a generalizable template for language-conditioned modulation of spatial representations, applicable across domains where flexible, adaptive visual feature decoding is required under natural language supervision.