Feature Contrastive Decoding (FCD)

Updated 2 July 2026

Feature Contrastive Decoding (FCD) is a module that combines super-resolution diffusion and contrastive decoding to produce precise, structured textual descriptions from pedestrian image patches.
It employs a multimodal LLM pipeline with dynamic prompt guidance and region-level token alignment to suppress hallucinated content and mitigate semantic drift.
By integrating techniques like patch token extraction, diffusion-based denoising, and contrastive token scoring, FCD enhances zero-shot pedestrian retrieval in open-world scenarios.

Feature Contrastive Decoding (FCD) is a module designed to generate high-fidelity, structured textual descriptions of pedestrian images in the context of open-world, zero-shot interactive Text-based Pedestrian Retrieval (TPR). Developed as a core component of the FitPro framework, FCD integrates structure-aware super-resolution with prompt-guided, contrastive decoding in a multimodal LLM pipeline to suppress hallucinated content and mitigate semantic drift, particularly in data-sparse or zero-shot regimes (Luo et al., 20 Sep 2025).

1. Problem Setting and Functional Role

FCD operates in a dialog-based, zero-shot TPR pipeline where the aim is to locate a target pedestrian within a gallery of uncropped scenes $I=\{I_1,\dots,I_N\}$ , given an initial user query $Q_0$ in natural language. In each retrieval round, regions containing candidate pedestrians are detected (e.g., via YOLO), and for each region, FCD is applied to:

Denoise and up-sample the raw crop to reconstruct a high-resolution, de-blurred pedestrian image patch $I_{\mathrm{opt}}$ .
Extract visual patch tokens for use as multimodal input.
Generate structured, concise textual descriptions $Y$ from $I_{\mathrm{opt}}$ , guided by dynamic prompts and region-level visual alignment.
Penalize output tokens whose visual attention correlates with irrelevant (background) patches, thereby reducing hallucinated or semantically drifting content.

This structured output is consumed by downstream FitPro modules: Incremental Semantic Mining (ISM) and Query-aware Hierarchical Retrieval (QHR).

2. Mathematical Foundation

FCD consists of a super-resolution and diffusion branch, visual token extraction, dynamic prompt construction, and a contrastive decoding mechanism within an LLM.

2.1 Super-resolution and Diffusion

Given a raw, possibly blurred pedestrian crop $I$ , a lightweight visual encoder produces both shallow features $F_0\in\mathbb{R}^{h_0\times w_0\times d_0}$ and deep features $F_d\in\mathbb{R}^{h_d\times w_d\times d_d}$ . These are concatenated and up-sampled via a small convolutional head $H_{\mathrm{rec}}$ , producing the initial high-resolution image $I_{sr}=H_{\mathrm{rec}}(F_0\|F_d)\in\mathbb{R}^{H\times W\times C_n}$ , where the up-sampling factor is 4.

A conditional diffusion process then refines $Q_0$ 0, with $Q_0$ 1 as the input to a DDIM-based denoising process parameterized by a pretrained tiny U-Net conditioned on a structural prior map $Q_0$ 2. At each timestep $Q_0$ 3 in $Q_0$ 4 steps, the update is

$Q_0$ 5

yielding $Q_0$ 6.

2.2 Visual Token Extraction and Projection

The refined image $Q_0$ 7 yields $Q_0$ 8 patch embeddings $Q_0$ 9 via a frozen vision backbone (e.g., ALBEF's ViT), each of dimensionality $I_{\mathrm{opt}}$ 0. These are projected into the multimodal LLM input space by a learnable linear transformation:

$I_{\mathrm{opt}}$ 1

2.3 Prompting Scheme

Input to the LLM consists of a concatenation

$I_{\mathrm{opt}}$ 2

with $I_{\mathrm{opt}}$ 3 a high-level instruction such as "Describe the pedestrian focusing on head, torso, legs, and accessories," and $I_{\mathrm{opt}}$ 4 instantiating the desired output structure (e.g., "HEAD: ...; UPPER-BODY: ...; LOWER-BODY: ...; ACCESSORIES: ...").

2.4 Feature Contrastive Decoding Objective

During text generation, for each step $I_{\mathrm{opt}}$ 5, the hidden state $I_{\mathrm{opt}}$ 6 is evaluated for alignment with $I_{\mathrm{opt}}$ 7 (foreground/target) and $I_{\mathrm{opt}}$ 8 (background) patch features via cosine similarity:

$I_{\mathrm{opt}}$ 9

The token selection logit is augmented with this contrastive score:

$Y$ 0

where $Y$ 1 is a tunable hyper-parameter. Greedy or stochastic decoding proceeds until an EOS is reached.

3. Architectural Choices and Hyper-parameters

The FCD module is instantiated as follows (Luo et al., 20 Sep 2025):

Diffusion denoising: Pretrained "ID-Blau" tiny U-Net; $Y$ 2 steps, linear $Y$ 3 schedule.
Reconstruction upsampling: $Y$ 4 upsample factor $Y$ 5.
Patch grid: $Y$ 6 ( $Y$ 7) for ViT backbone.
Visual-to-LLM projection: $Y$ 8.
Multimodal LLM: LLaVA backbone, beam size 1, length penalty 1.0.
Contrast weight: $Y$ 9 (held-out set tuned).
Structural prior: $I_{\mathrm{opt}}$ 0 extracted using edge detector + pose estimator, only used as conditional input for diffusion.
Prompt templates: See Section 2.3 for example phrasing.

No fine-tuning is performed for the LLM or diffusion U-Net in zero-shot inference. If training components (e.g., $I_{\mathrm{opt}}$ 1, $I_{\mathrm{opt}}$ 2) is desired, pixel-wise MSE and standard diffusion reconstruction objectives are adopted.

4. Inference Procedure

At inference time, the module follows the procedure summarized in the table below:

Step	Operation	Output
Visual encoding	$I_{\mathrm{opt}}$ 3	Feature tensors
Reconstruction	$I_{\mathrm{opt}}$ 4	Super-resolved image
Diffusion denoising	$I_{\mathrm{opt}}$ 5 reverse DDIM (see above)	Denoised patch
Patch embedding and projection	Obtain $I_{\mathrm{opt}}$ 6	Patch tokens
LLM prompt construction	$I_{\mathrm{opt}}$ 7	LLM input
Contrastive decoding loop	Select $I_{\mathrm{opt}}$ 8 maximizing $I_{\mathrm{opt}}$ 9	Structured text $I$ 0

During decoding, the region-level attention of candidate tokens is evaluated continuously; tokens whose attention shifts toward background receive a negative contrastive score, suppressing their probability. Tokens grounded in pedestrian-relevant content (e.g., clothing color, accessory presence) are reinforced.

5. Application Example and Output

Consider a blurred, low-resolution pedestrian image where the subject carries a black backpack. FCD first reconstructs a denoised, sharp representation $I$ 1 emphasizing fine structures such as straps or caps. The vision encoder extracts 196 patch tokens retaining this detail. Using structured prompt scaffolding, the LLM, through region-contrastive decoding, generates a detailed description such as:

HEAD: short brown hair; UPPER-BODY: wearing a green hoodie; LOWER-BODY: dark blue jeans; ACCESSORIES: black backpack with two silver zippers.

This output provides semantically rich and visually faithful attributes for downstream retrieval.

6. Significance and Perspectives

By integrating super-resolution diffusion mechanisms with prompt-guided, region-level contrastive decoding, FCD addresses major limitations in open-world zero-shot retrieval scenarios—notably semantic drift and generation of off-target content. The explicit grounding of textual tokens in visual evidence reduces hallucination, significantly enhancing the semantic fidelity of image-to-text mappings. This enables robust, structured description extraction from challenging surveillance and open-world datasets, functioning as a critical precursor for modules such as ISM and QHR in the FitPro framework (Luo et al., 20 Sep 2025).

A plausible implication is that the feature-contrastive approach in FCD could be extended to additional retrieval and grounding tasks (e.g., fine-grained object annotation) where assurance of visual faithfulness and cross-modal consistency is required, particularly when labeled data is scarce.

Markdown Report Issue Upgrade to Chat

References (1)

FitPro: A Zero-Shot Framework for Interactive Text-based Pedestrian Retrieval in Open World (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Feature Contrastive Decoding (FCD).