Feature Contrastive Decoding (FCD)
- Feature Contrastive Decoding (FCD) is a module that combines super-resolution diffusion and contrastive decoding to produce precise, structured textual descriptions from pedestrian image patches.
- It employs a multimodal LLM pipeline with dynamic prompt guidance and region-level token alignment to suppress hallucinated content and mitigate semantic drift.
- By integrating techniques like patch token extraction, diffusion-based denoising, and contrastive token scoring, FCD enhances zero-shot pedestrian retrieval in open-world scenarios.
Feature Contrastive Decoding (FCD) is a module designed to generate high-fidelity, structured textual descriptions of pedestrian images in the context of open-world, zero-shot interactive Text-based Pedestrian Retrieval (TPR). Developed as a core component of the FitPro framework, FCD integrates structure-aware super-resolution with prompt-guided, contrastive decoding in a multimodal LLM pipeline to suppress hallucinated content and mitigate semantic drift, particularly in data-sparse or zero-shot regimes (Luo et al., 20 Sep 2025).
1. Problem Setting and Functional Role
FCD operates in a dialog-based, zero-shot TPR pipeline where the aim is to locate a target pedestrian within a gallery of uncropped scenes , given an initial user query in natural language. In each retrieval round, regions containing candidate pedestrians are detected (e.g., via YOLO), and for each region, FCD is applied to:
- Denoise and up-sample the raw crop to reconstruct a high-resolution, de-blurred pedestrian image patch .
- Extract visual patch tokens for use as multimodal input.
- Generate structured, concise textual descriptions from , guided by dynamic prompts and region-level visual alignment.
- Penalize output tokens whose visual attention correlates with irrelevant (background) patches, thereby reducing hallucinated or semantically drifting content.
This structured output is consumed by downstream FitPro modules: Incremental Semantic Mining (ISM) and Query-aware Hierarchical Retrieval (QHR).
2. Mathematical Foundation
FCD consists of a super-resolution and diffusion branch, visual token extraction, dynamic prompt construction, and a contrastive decoding mechanism within an LLM.
2.1 Super-resolution and Diffusion
Given a raw, possibly blurred pedestrian crop , a lightweight visual encoder produces both shallow features and deep features . These are concatenated and up-sampled via a small convolutional head , producing the initial high-resolution image , where the up-sampling factor is 4.
A conditional diffusion process then refines 0, with 1 as the input to a DDIM-based denoising process parameterized by a pretrained tiny U-Net conditioned on a structural prior map 2. At each timestep 3 in 4 steps, the update is
5
yielding 6.
2.2 Visual Token Extraction and Projection
The refined image 7 yields 8 patch embeddings 9 via a frozen vision backbone (e.g., ALBEF's ViT), each of dimensionality 0. These are projected into the multimodal LLM input space by a learnable linear transformation:
1
2.3 Prompting Scheme
Input to the LLM consists of a concatenation
2
with 3 a high-level instruction such as "Describe the pedestrian focusing on head, torso, legs, and accessories," and 4 instantiating the desired output structure (e.g., "HEAD: ...; UPPER-BODY: ...; LOWER-BODY: ...; ACCESSORIES: ...").
2.4 Feature Contrastive Decoding Objective
During text generation, for each step 5, the hidden state 6 is evaluated for alignment with 7 (foreground/target) and 8 (background) patch features via cosine similarity:
9
The token selection logit is augmented with this contrastive score:
0
where 1 is a tunable hyper-parameter. Greedy or stochastic decoding proceeds until an EOS is reached.
3. Architectural Choices and Hyper-parameters
The FCD module is instantiated as follows (Luo et al., 20 Sep 2025):
- Diffusion denoising: Pretrained "ID-Blau" tiny U-Net; 2 steps, linear 3 schedule.
- Reconstruction upsampling: 4 upsample factor 5.
- Patch grid: 6 (7) for ViT backbone.
- Visual-to-LLM projection: 8.
- Multimodal LLM: LLaVA backbone, beam size 1, length penalty 1.0.
- Contrast weight: 9 (held-out set tuned).
- Structural prior: 0 extracted using edge detector + pose estimator, only used as conditional input for diffusion.
- Prompt templates: See Section 2.3 for example phrasing.
No fine-tuning is performed for the LLM or diffusion U-Net in zero-shot inference. If training components (e.g., 1, 2) is desired, pixel-wise MSE and standard diffusion reconstruction objectives are adopted.
4. Inference Procedure
At inference time, the module follows the procedure summarized in the table below:
| Step | Operation | Output |
|---|---|---|
| Visual encoding | 3 | Feature tensors |
| Reconstruction | 4 | Super-resolved image |
| Diffusion denoising | 5 reverse DDIM (see above) | Denoised patch |
| Patch embedding and projection | Obtain 6 | Patch tokens |
| LLM prompt construction | 7 | LLM input |
| Contrastive decoding loop | Select 8 maximizing 9 | Structured text 0 |
During decoding, the region-level attention of candidate tokens is evaluated continuously; tokens whose attention shifts toward background receive a negative contrastive score, suppressing their probability. Tokens grounded in pedestrian-relevant content (e.g., clothing color, accessory presence) are reinforced.
5. Application Example and Output
Consider a blurred, low-resolution pedestrian image where the subject carries a black backpack. FCD first reconstructs a denoised, sharp representation 1 emphasizing fine structures such as straps or caps. The vision encoder extracts 196 patch tokens retaining this detail. Using structured prompt scaffolding, the LLM, through region-contrastive decoding, generates a detailed description such as:
HEAD: short brown hair; UPPER-BODY: wearing a green hoodie; LOWER-BODY: dark blue jeans; ACCESSORIES: black backpack with two silver zippers.
This output provides semantically rich and visually faithful attributes for downstream retrieval.
6. Significance and Perspectives
By integrating super-resolution diffusion mechanisms with prompt-guided, region-level contrastive decoding, FCD addresses major limitations in open-world zero-shot retrieval scenarios—notably semantic drift and generation of off-target content. The explicit grounding of textual tokens in visual evidence reduces hallucination, significantly enhancing the semantic fidelity of image-to-text mappings. This enables robust, structured description extraction from challenging surveillance and open-world datasets, functioning as a critical precursor for modules such as ISM and QHR in the FitPro framework (Luo et al., 20 Sep 2025).
A plausible implication is that the feature-contrastive approach in FCD could be extended to additional retrieval and grounding tasks (e.g., fine-grained object annotation) where assurance of visual faithfulness and cross-modal consistency is required, particularly when labeled data is scarce.