Papers
Topics
Authors
Recent
2000 character limit reached

DIR-TIR Framework Overview

Updated 24 November 2025
  • DIR-TIR is a framework employing iterative, multi-stage refinement to bridge semantic or modality gaps in text-image retrieval, TIR tracking, and image denoising.
  • It integrates dialog-based human interaction, adversarial domain adaptation, and dual-domain loss optimization to deliver enhanced retrieval and tracking performance.
  • Empirical results show improved Hits@10 (+8–12%), robust TIR tracking gains, and superior denoising metrics (PSNR 27.97 dB, SSIM 0.8594) without retraining core models.

The DIR-TIR ("Dialog-Iterative Refinement" and "Domain-Iterative Refinement," see below) framework encompasses a family of methods targeting two distinct technical domains: (1) human-in-the-loop text-to-image retrieval using dialog and iterative refinements (Zhen et al., 18 Nov 2025), and (2) progressive domain adaptation and denoising for thermal infrared (TIR) vision (notably in tracking and image restoration) (Li et al., 28 Jul 2024, Rhee et al., 30 Jul 2025). Despite domain specificity, all approaches share an iterative, multi-stage refinement protocol that incrementally reduces semantic or modality gaps, either via dialog with a user or unsupervised learning across data modalities.

1. Dialog-Iterative Refinement for Text-to-Image Retrieval

DIR-TIR for text-to-image retrieval (Zhen et al., 18 Nov 2025) is a plug-and-play wrapper for modern vision–language retrieval systems (e.g., BLIP, CLIP) that introduces a turn-based, user-interactive refinement process. Each iteration alternates between two modules: the Dialog Refiner Module (DRM) and the Image Refiner Module (IRM). Over up to 10 dialogue turns, DRM poses targeted follow-up questions to the user, evolving the descriptive state of the target image, while IRM synthesizes candidate images from the updated prompt and prompts the user to correct perceptual discrepancies.

At each turn kk, DRM updates the dialogue state dkd_k by (a) generating new questions with a transformer LLM, (b) simulating user answers via a vision-LLM, and (c) refining the textual description. Parallelly, IRM generates an image PkP_k based on the prompt GkG_k (derived from dkd_k), obtains user feedback on discrepancies, and updates the image prompt to Gk+1G_{k+1}. Both modules perform independent retrievals: DRM ranks gallery images by cosine similarity between their embeddings and dkd_k, while IRM ranks by similarity between gallery images and the embedding of PkP_k. The system merges the top-$7$ DRM and top-$3$ IRM retrievals (optimal ratio), terminating upon a Hit@10 or after 10 turns.

This dialog-driven loop systematically closes semantic and visual gaps, achieving significant gains: Recall@10 and Hits@10 improve by +8–12 percentage points over zero-shot and fine-tuned baselines at turn 5 across COCO, VisDial, and Flickr30k datasets.

2. DIR-TIR for Domain Adaptation in Thermal Infrared Tracking

In the thermal infrared tracking context, DIR-TIR, alternatively termed Progressive Domain Adaptation for TIR Tracking (PDAT), addresses the pronounced domain shift between RGB-trained trackers and unlabelled TIR data (Li et al., 28 Jul 2024). PDAT is constructed atop a standard Siamese tracker (SiamCAR with ResNet-50) and couples two domain-adaptation modules in a coarse-to-fine synergistic schedule:

  • Adversarial-based Global Domain Adaptation (AGDA): Employs a style discriminator with a gradient reversal layer and self-attention transformers to enforce style invariance at the feature level. The training objective alternates least-squares GAN losses for generator (feature extractor) and discriminator, encouraging RGB and TIR features to align globally.
  • Clustering-based Subdomain Adaptation (CSDA): After AGDA, features are cross-correlated and clustered (with K-means; CC chosen by Silhouette criterion). Within each subdomain cluster, a Local Maximum Mean Discrepancy (LMMD) loss aligns source and target features.

Training is performed on template–search pairs pseudo-labeled using the Segment Anything Model (SAM), yielding over 10 million pairs from a 1.48M-frame TIR corpus. No manual TIR annotation is required.

Benchmarks across five TIR tracking datasets (LSOTB-TIR100/120, PTB-TIR, VTUAV, VOT-TIR2017) confirm substantial success rate improvements (+5.6–7.4 percentage points) over vanilla SiamCAR and show that domain adaptation rather than capacity determines SOTA performance.

3. DIR-TIR for Diffusion-Based TIR Image Denoising

A third instantiation, TIR-Diffusion ("DIR-TIR"), applies iterative dual-domain constraints to image denoising for single-frame TIR restoration (Rhee et al., 30 Jul 2025). The DIR-TIR denoising pipeline builds atop a pretrained Stable Diffusion model, enhancing it with:

  • Latent-space denoising: The noisy TIR input xnoisy∈RH×W×3x_{\text{noisy}}\in\mathbb{R}^{H\times W\times3} is encoded via a VAE to znoisyz_{\text{noisy}}, serving as both the noise target for the diffusion model and as model conditioning.
  • Wavelet-domain penalization: Parallel to latent MSE/SSIM loss, high-frequency wavelet/DTCWT subbands of the output and ground truth are regularized with strong MSE penalties, prioritizing edge and texture fidelity.
  • Cascaded refinement: An optional second diffusion model further refines pixel-level output using an MSE+LPIPS (perceptual) loss.

The multi-objective loss is:

Ltotal=λlatentLlatent+λDWTLDWT+λDTCWTLDTCWT\mathcal{L}_{\text{total}} = \lambda_{\text{latent}}\mathcal{L}_{\text{latent}} + \lambda_{\text{DWT}}\mathcal{L}_{\text{DWT}} + \lambda_{\text{DTCWT}}\mathcal{L}_{\text{DTCWT}}

with specific weights (λlatent=1\lambda_{\text{latent}}=1, λDWT=100\lambda_{\text{DWT}}=100, or λDTCWT=100\lambda_{\text{DTCWT}}=100) to balance loss magnitude.

Experimental benchmarks show that DIR-TIR achieves PSNR = 27.97 dB, SSIM = 0.8594, LPIPS = 0.156, and FID = 40.47 on standard real TIR denoising sets, outperforming previous models. The model exhibits strong zero-shot generalization to unseen TIR data, with fieldscale normalization for domain adaptation.

4. Comparison of Iterative Refinement Protocols

All DIR-TIR instantiations implement iterative, multi-stage refinement. In text-to-image retrieval (Zhen et al., 18 Nov 2025), this takes the form of alternated human-in-the-loop dialog and image-based feedback cycles; in TIR tracking (Li et al., 28 Jul 2024), coarse-to-fine domain adaptation is realized via adversarial learning followed by subdomain clustering and alignment; in image denoising (Rhee et al., 30 Jul 2025), dual-domain loss scheduling and cascaded denoising stages progressively enhance the output.

Variant Domain Iteration Type Data/Interaction
DIR-TIR (Retrieval) Vision–language retrieval Human-in-the-loop Text dialog, image feedback
PDAT (TIR Tracking) TIR object tracking Unsupervised, model-based Pseudo-labeled TIR, RGB pairs
DIR-TIR (Denoising) TIR image denoising Dual-domain loss, cascaded Noisy–clean TIR pairs

This iterative framework allows progressive information injection or alignment, closing large modality or semantic gaps that standard single-turn or one-stage methods leave unresolved.

5. Empirical Performance and Practical Considerations

Quantitative and ablation studies across all DIR-TIR-based works show systematic gains:

  • In interactive retrieval (Zhen et al., 18 Nov 2025), hybrid DRM/IRM selection at a 7:3 ratio achieves best trade-offs, delivering cumulative Hits@10 gains of +8–12 points at turn 5.
  • For TIR tracking (Li et al., 28 Jul 2024), both AGDA and CSDA contribute to gains; SAM-generated pseudo-labeling is most impactful and fast.
  • Denoising with DIR-TIR (Rhee et al., 30 Jul 2025) excels on pixel and perceptual metrics, avoiding artifacts seen in GAN-based or standard pixel-MSE approaches, and matches requirements for real-world deployment (e.g., reduction of diffusion steps for real-time operation).

No fine-tuning or retraining of core inference models is required for either retrieval or tracking versions; all adaptation is performed by interaction or domain-alignment modules. In denoising, only the U-Net and conditioning heads are updated; the pretrained VAE remains fixed.

6. Limitations, Extensions, and Future Directions

For dialog-based retrieval, simulated user responses currently replace real human interaction, which could diverge from practical deployment scenarios. In domain adaptation for tracking, online updating of subdomain clusters and unification of image-level adaptation remain open avenues. DIR-TIR denoising may hallucinate plausible but inexact structures in extremely low-contrast TIR regions; perceptual refinement can occasionally trade away exactness for realism.

Planned extensions include synthesis of image-level and feature-level adaptation for TIR style alignment, hierarchical multi-modal tracking, and improved online cluster management. The pragmatic aspect is emphasized—many methods are designed for immediate applicability (no retraining, real-time feasibility) and robust zero-shot operation across unknown data domains. This suggests DIR-TIR frameworks function as powerful annotation-free, progressive solution templates for a variety of cross-modal, interactive, or low-data vision tasks.

7. References

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to DIR-TIR Framework.