DIR-TIR Framework Overview
- DIR-TIR is a framework employing iterative, multi-stage refinement to bridge semantic or modality gaps in text-image retrieval, TIR tracking, and image denoising.
- It integrates dialog-based human interaction, adversarial domain adaptation, and dual-domain loss optimization to deliver enhanced retrieval and tracking performance.
- Empirical results show improved Hits@10 (+8–12%), robust TIR tracking gains, and superior denoising metrics (PSNR 27.97 dB, SSIM 0.8594) without retraining core models.
The DIR-TIR ("Dialog-Iterative Refinement" and "Domain-Iterative Refinement," see below) framework encompasses a family of methods targeting two distinct technical domains: (1) human-in-the-loop text-to-image retrieval using dialog and iterative refinements (Zhen et al., 18 Nov 2025), and (2) progressive domain adaptation and denoising for thermal infrared (TIR) vision (notably in tracking and image restoration) (Li et al., 28 Jul 2024, Rhee et al., 30 Jul 2025). Despite domain specificity, all approaches share an iterative, multi-stage refinement protocol that incrementally reduces semantic or modality gaps, either via dialog with a user or unsupervised learning across data modalities.
1. Dialog-Iterative Refinement for Text-to-Image Retrieval
DIR-TIR for text-to-image retrieval (Zhen et al., 18 Nov 2025) is a plug-and-play wrapper for modern vision–language retrieval systems (e.g., BLIP, CLIP) that introduces a turn-based, user-interactive refinement process. Each iteration alternates between two modules: the Dialog Refiner Module (DRM) and the Image Refiner Module (IRM). Over up to 10 dialogue turns, DRM poses targeted follow-up questions to the user, evolving the descriptive state of the target image, while IRM synthesizes candidate images from the updated prompt and prompts the user to correct perceptual discrepancies.
At each turn , DRM updates the dialogue state by (a) generating new questions with a transformer LLM, (b) simulating user answers via a vision-LLM, and (c) refining the textual description. Parallelly, IRM generates an image based on the prompt (derived from ), obtains user feedback on discrepancies, and updates the image prompt to . Both modules perform independent retrievals: DRM ranks gallery images by cosine similarity between their embeddings and , while IRM ranks by similarity between gallery images and the embedding of . The system merges the top-$7$ DRM and top-$3$ IRM retrievals (optimal ratio), terminating upon a Hit@10 or after 10 turns.
This dialog-driven loop systematically closes semantic and visual gaps, achieving significant gains: Recall@10 and Hits@10 improve by +8–12 percentage points over zero-shot and fine-tuned baselines at turn 5 across COCO, VisDial, and Flickr30k datasets.
2. DIR-TIR for Domain Adaptation in Thermal Infrared Tracking
In the thermal infrared tracking context, DIR-TIR, alternatively termed Progressive Domain Adaptation for TIR Tracking (PDAT), addresses the pronounced domain shift between RGB-trained trackers and unlabelled TIR data (Li et al., 28 Jul 2024). PDAT is constructed atop a standard Siamese tracker (SiamCAR with ResNet-50) and couples two domain-adaptation modules in a coarse-to-fine synergistic schedule:
- Adversarial-based Global Domain Adaptation (AGDA): Employs a style discriminator with a gradient reversal layer and self-attention transformers to enforce style invariance at the feature level. The training objective alternates least-squares GAN losses for generator (feature extractor) and discriminator, encouraging RGB and TIR features to align globally.
- Clustering-based Subdomain Adaptation (CSDA): After AGDA, features are cross-correlated and clustered (with K-means; chosen by Silhouette criterion). Within each subdomain cluster, a Local Maximum Mean Discrepancy (LMMD) loss aligns source and target features.
Training is performed on template–search pairs pseudo-labeled using the Segment Anything Model (SAM), yielding over 10 million pairs from a 1.48M-frame TIR corpus. No manual TIR annotation is required.
Benchmarks across five TIR tracking datasets (LSOTB-TIR100/120, PTB-TIR, VTUAV, VOT-TIR2017) confirm substantial success rate improvements (+5.6–7.4 percentage points) over vanilla SiamCAR and show that domain adaptation rather than capacity determines SOTA performance.
3. DIR-TIR for Diffusion-Based TIR Image Denoising
A third instantiation, TIR-Diffusion ("DIR-TIR"), applies iterative dual-domain constraints to image denoising for single-frame TIR restoration (Rhee et al., 30 Jul 2025). The DIR-TIR denoising pipeline builds atop a pretrained Stable Diffusion model, enhancing it with:
- Latent-space denoising: The noisy TIR input is encoded via a VAE to , serving as both the noise target for the diffusion model and as model conditioning.
- Wavelet-domain penalization: Parallel to latent MSE/SSIM loss, high-frequency wavelet/DTCWT subbands of the output and ground truth are regularized with strong MSE penalties, prioritizing edge and texture fidelity.
- Cascaded refinement: An optional second diffusion model further refines pixel-level output using an MSE+LPIPS (perceptual) loss.
The multi-objective loss is:
with specific weights (, , or ) to balance loss magnitude.
Experimental benchmarks show that DIR-TIR achieves PSNR = 27.97 dB, SSIM = 0.8594, LPIPS = 0.156, and FID = 40.47 on standard real TIR denoising sets, outperforming previous models. The model exhibits strong zero-shot generalization to unseen TIR data, with fieldscale normalization for domain adaptation.
4. Comparison of Iterative Refinement Protocols
All DIR-TIR instantiations implement iterative, multi-stage refinement. In text-to-image retrieval (Zhen et al., 18 Nov 2025), this takes the form of alternated human-in-the-loop dialog and image-based feedback cycles; in TIR tracking (Li et al., 28 Jul 2024), coarse-to-fine domain adaptation is realized via adversarial learning followed by subdomain clustering and alignment; in image denoising (Rhee et al., 30 Jul 2025), dual-domain loss scheduling and cascaded denoising stages progressively enhance the output.
| Variant | Domain | Iteration Type | Data/Interaction |
|---|---|---|---|
| DIR-TIR (Retrieval) | Vision–language retrieval | Human-in-the-loop | Text dialog, image feedback |
| PDAT (TIR Tracking) | TIR object tracking | Unsupervised, model-based | Pseudo-labeled TIR, RGB pairs |
| DIR-TIR (Denoising) | TIR image denoising | Dual-domain loss, cascaded | Noisy–clean TIR pairs |
This iterative framework allows progressive information injection or alignment, closing large modality or semantic gaps that standard single-turn or one-stage methods leave unresolved.
5. Empirical Performance and Practical Considerations
Quantitative and ablation studies across all DIR-TIR-based works show systematic gains:
- In interactive retrieval (Zhen et al., 18 Nov 2025), hybrid DRM/IRM selection at a 7:3 ratio achieves best trade-offs, delivering cumulative Hits@10 gains of +8–12 points at turn 5.
- For TIR tracking (Li et al., 28 Jul 2024), both AGDA and CSDA contribute to gains; SAM-generated pseudo-labeling is most impactful and fast.
- Denoising with DIR-TIR (Rhee et al., 30 Jul 2025) excels on pixel and perceptual metrics, avoiding artifacts seen in GAN-based or standard pixel-MSE approaches, and matches requirements for real-world deployment (e.g., reduction of diffusion steps for real-time operation).
No fine-tuning or retraining of core inference models is required for either retrieval or tracking versions; all adaptation is performed by interaction or domain-alignment modules. In denoising, only the U-Net and conditioning heads are updated; the pretrained VAE remains fixed.
6. Limitations, Extensions, and Future Directions
For dialog-based retrieval, simulated user responses currently replace real human interaction, which could diverge from practical deployment scenarios. In domain adaptation for tracking, online updating of subdomain clusters and unification of image-level adaptation remain open avenues. DIR-TIR denoising may hallucinate plausible but inexact structures in extremely low-contrast TIR regions; perceptual refinement can occasionally trade away exactness for realism.
Planned extensions include synthesis of image-level and feature-level adaptation for TIR style alignment, hierarchical multi-modal tracking, and improved online cluster management. The pragmatic aspect is emphasized—many methods are designed for immediate applicability (no retraining, real-time feasibility) and robust zero-shot operation across unknown data domains. This suggests DIR-TIR frameworks function as powerful annotation-free, progressive solution templates for a variety of cross-modal, interactive, or low-data vision tasks.
7. References
- "DIR-TIR: Dialog-Iterative Refinement for Text-to-Image Retrieval" (Zhen et al., 18 Nov 2025)
- "Progressive Domain Adaptation for Thermal Infrared Object Tracking" (Li et al., 28 Jul 2024)
- "TIR-Diffusion: Diffusion-based Thermal Infrared Image Denoising via Latent and Wavelet Domain Optimization" (Rhee et al., 30 Jul 2025)