IntrinsicReal: Real-World Albedo Adaptation
- The paper introduces a dual pseudo-labeling strategy that combines absolute confidence and relative preference ranking to effectively bridge the domain gap.
- It proposes a two-phase adaptation pipeline that iteratively refines the model using classifier-guided updates and a direct preference optimization loss.
- Empirical results demonstrate significant improvements in PSNR, SSIM, and MSE, validating the method's robustness over existing approaches.
Estimating albedo, or intrinsic image decomposition, from single RGB images in real-world settings presents a significant challenge due to the absence of paired real images with ground-truth albedo and the large domain gap between synthetic training data and complex real-world conditions. Recent diffusion-prior-based methods such as IntrinsicAnything have achieved high accuracy when trained on large synthetic datasets (e.g., Objaverse) but tend to generalize suboptimally to real images. IntrinsicReal is a domain adaptation framework specifically designed to bridge this synthetic–real gap, combining dual pseudo-labeling (absolute and relative, inspired by human evaluation heuristics) with a two-phase adaptation pipeline for robust, state-of-the-art real-world albedo estimation.
1. Dual Pseudo-Labeling for Real-World Adaptation
IntrinsicReal introduces a dual pseudo-labeling strategy to enable adaptation without access to real-world ground-truth albedo:
- Pseudo-Labeling with an Absolute Confidence Threshold: An IR-Classifier, trained on synthetic paired data, assigns quality scores to albedo predictions. An absolute threshold is used for hard-positive selection (e.g., albedo with score ≥0.99 as “positive pseudo-label”) and hard-negative rejection (e.g., score ≤0.3 as “negative pseudo-label”). These pseudo-pairs (RGB, selected albedo) drive the initial round of IR-Model fine-tuning.
- Pseudo-Labeling by Relative Preference Ranking: When absolute confidence is ambiguous, IntrinsicReal shifts to human-inspired relative assessment. For each object, it compares albedo outputs from different model iterations (e.g., x₀, x₁, x_final), using the classifier’s scores to select pairs where the final output outperforms the earlier one. This set of winner–loser pairs guides further adaptation using a preference-based loss.
This two-pronged approach mimics human evaluation: absolute decisions for clearly good/bad outputs and relative ranking when uncertainty exists.
2. Two-Phase Adaptation Pipeline
IntrinsicReal operationalizes its pseudo-labeling via a sequential two-phase architecture:
Phase 1: Iterative Classifier-Model Adaptation
- The IR-Model is initialized from a synthetic-trained checkpoint (IntrinsicAnything).
- The IR-Classifier is pre-trained with paired (albedo, shading) data from synthetic sources, using the Lambertian composition relation to ensure consistency.
- Manually annotated real albedo outputs seed initial positive and negative sets.
- For real-world images, high-confidence pseudo-labels (from the classifier) are used for supervised fine-tuning of the IR-Model. The positive/negative sets are iteratively rectified and used to re-train or update the classifier, forming a bootstrapped feedback loop between model and classifier.
Phase 2: Classifier-Guided Relative Preference Adaptation
- Once Phase 1 converges, the model may still generate suboptimal cases insufficiently captured by hard thresholds.
- For each real image, albedo outputs from multiple training iterations are compared; if the most recent improves on a prior version (as rated by the classifier), the pair is used for direct preference optimization (DPO).
- The DPO loss is defined as:
where and are model outputs (possibly with Gaussian noise augmentation) for the preferred/dispreferred albedo, and comes from the pre-trained reference model.
- This phase exploits the finer ranking information present in the classifier, improving the adaptation beyond coarse thresholding.
3. Bridging the Synthetic–Real Domain Gap
IntrinsicReal directly addresses several core domain-adaptation challenges:
- Absence of Real-World Paired Supervision: By leveraging the classifier for both positive and negative pseudo-label selection, IntrinsicReal generates high-confidence supervision, circumventing the need for real ground-truth.
- Domain Gap in Lighting, Texture, and Object Properties: The iterative adaptation (Phase 1) and the joint classifier-model bootstrapping enable progressive correction of domain-induced artifacts, while Phase 2 DPO capitalizes on hard-to-capture real-world nuances.
- Robustness to Classifier Ambiguity: The combination of absolute and relative pseudo-labeling ensures that both highly confident and ambiguous cases are handled for maximal adaptation.
This hybrid approach ensures that only reliable predictions are used to update the model, and minimizes the risk of error accumulation common in self-training scenarios.
4. Empirical Results and Performance Analysis
IntrinsicReal demonstrates substantial improvements over existing methods (including IntrinsicAnything and RGB-X) on both synthetic (MIT Intrinsic Dataset) and real-world (MVImgNet) benchmarks:
Method | PSNR (↑) | SSIM (↑) | MSE (↓) |
---|---|---|---|
IntrinsicAnything | 15.284 | 0.716 | 0.035 |
RGB-X | 16.420 | 0.728 | 0.029 |
IntrinsicReal | 17.449 | 0.758 | 0.024 |
Qualitative results show that IntrinsicAnything trained on synthetic data overproduces black or dark albedo for real-world objects with complex appearance (e.g., metallic, glossy, or heavily shadowed surfaces). In contrast, IntrinsicReal restores plausible albedo, accurately recovering fine details under challenging lighting, occlusion, or highlight conditions.
Ablation studies confirm that each component—the absolute pseudo-labeling, iterative rectification, and DPO-based fine-tuning—contributes measurably to domain adaptation. User studies on MVImgNet further corroborate expert preference for IntrinsicReal reconstructions.
5. Detailed Methodological Steps and Mathematical Tools
The design incorporates physically motivated decompositions, classifier-driven supervisory signal, and state-of-the-art preference optimization:
- Lambertian Synthesis for Synthetic Pretraining: links albedo and shading, providing a ground-truth signal for initial classifier training.
- Classifier Score Thresholding: High-precision thresholds (e.g., for positive pseudo-labels, for negatives) filter outputs for supervised adaptation.
- Rectification and Joint-Update: The mutually reinforcing training of the classifier and the model refines pseudo-label reliability after each adaptation pass.
- Relative Ranking and DPO Loss: For images where absolute confidence is low, winner–loser pairs drive optimization via the DPO loss, prioritizing improvements in outputs that matter most for realistic appearance. The DPO is formally defined to pull the model’s prediction towards the winner and away from the loser relative to the reference prediction.
- Iterative Bootstrapping: The whole pipeline proceeds through multiple adaptation rounds, successively narrowing the synthetic–real gap.
6. Challenges Addressed by IntrinsicReal
IntrinsicReal systematically tackles primary roadblocks in real-world intrinsic decomposition:
- No Ground-Truth Albedo: Pseudo-labeling and DPO supply supervision when no reference is available.
- Domain Shift: Fine-tuning on classifier-qualified outputs, plus iterative adaptation, brings the model’s predictions closer to real-world statistics.
- Subtle Output Quality Differences: Relative ranking captures marginal but meaningful quality distinctions overlooked by absolute confidence alone.
These methodological choices underpin the observed gains in PSNR, SSIM, and MSE, as well as the perceptual realism of the decompositions.
7. Significance and Future Directions
IntrinsicReal establishes that careful fusion of absolute and relative pseudo-labeling, when operationalized through an iterative, classifier-bootstrapped, and DPO-fine-tuned pipeline, yields superior adaptation of synthetic intrinsic image decomposition models to real-world data. The precision and flexibility of this framework make it applicable to other domain adaptation tasks in vision where paired ground-truth is absent and domain gaps are pronounced.
Extensions may include adaptation to specialized object classes, incorporation of multi-modal priors, or further automation of pseudo-label generation. The methodology points toward a general paradigm for leveraging synthetic data and proxy supervision to bridge the realism gap in high-level image understanding tasks.