Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Embodied Domain Adaptation for Object Detection (EDAOD)

Updated 1 July 2025

EDAOD is a framework that facilitates source-free domain adaptation for open-vocabulary object detection in dynamic, real-world indoor settings.
It employs a Mean Teacher framework, temporal clustering, and multi-threshold fusion to robustly address sequential domain shifts and diverse environment layouts.
Experimental results on benchmarks like iTHOR and Matterport3D demonstrate its potential for lifelong robotic perception in privacy-sensitive applications.

Embodied Domain Adaptation for Object Detection (EDAOD) is a framework and benchmark designed to facilitate robust, continual object detection for mobile robots and embodied agents as they operate in dynamic, real-world indoor environments such as homes and laboratories. Unlike classical or open-vocabulary object detection, EDAOD specifically addresses the combined challenges of domain shift, sequential environmental changes, object diversity, and the lack of access to labeled source data—conditions characteristic of real deployment scenarios in robotics and embodied AI.

1. Problem Setting and Motivation

EDAOD focuses on the source-free domain adaptation (SFDA) setting, where adaptation to the target environment is performed without retaining any source domain data. This is motivated by privacy and proprietary concerns common in domestic and sensitive environments. EDAOD targets open-vocabulary object detection (OVOD), where the set of object categories is potentially unbounded and subject to change, and where models must adapt to:

Changeable lighting, rearranged or newly introduced objects, and diverse layouts,
Sequential, egocentric input streams reflecting the agent’s movement and experience,
Environments in which only the target domain data (typically unlabeled) is available during adaptation.

The EDAOD benchmark and methodology were introduced to enable zero-shot detection performance gains under such continual, real-world domain shifts (2506.21860).

2. Technical Methodology

EDAOD presents a comprehensive SFDA pipeline for OVOD that couples unsupervised adaptation with temporal and contrastive learning, implemented as follows:

Mean Teacher Framework for OVOD

EDAOD builds upon the Mean Teacher paradigm, initializing both student and teacher detection models from a pre-trained open-vocabulary detector (Detic). The teacher is updated as an exponential moving average (EMA) of the student parameters:

$\Theta_{T, t} = \alpha_1 \Theta_{T, t-1} + (1-\alpha_1)\Theta_{S, t-1}$

Pseudo-labels are generated by the teacher on weakly augmented inputs, and the student is trained using both these pseudo-labels and strongly augmented views. The detection loss for each batch aggregates RPN, classification, and regression objectives over all stages of Cascade R-CNN.

Temporal Clustering of Pseudo-Labels

Instead of standard per-frame pseudo-label selection, EDAOD leverages temporal consistency across sequential frames. Bounding boxes for object proposals are tracked using a spatio-appearance similarity score:

$S_{i,j} = \frac{\text{IoU}_{i,j}}{E(\mathbf{l}_{\text{prev}[i]}, \mathbf{l}_{\text{curr}[j]}) + \epsilon}$

where the numerator is the intersection over union between bounding boxes and the denominator is feature embedding distance. Assignment is performed using the Hungarian algorithm based on a threshold $\tau_1$ .

Cluster merging across time is performed using feature similarity at segment boundaries, with clusters of the same object being merged if their boundary features exceed a similarity threshold $\tau_2$ . Each resulting cluster represents a temporally consistent object identity, and the logits of all its proposals are averaged to obtain a more reliable pseudo-label:

$\mathbf{l}_{C_i} = \frac{1}{|C_i|} \sum_{b \in C_i} \mathbf{l}_b$

Multi-Scale Threshold Fusion

Recognizing that choosing a single $\tau_2$ can trade off between under- and over-clustering, EDAOD runs the full adaptation pipeline with multiple thresholds $\{\tau_{2,1}, ... \tau_{2,n}\}$ , forming an ensemble by averaging the resulting teacher model weights:

$\Theta_{Tf} = \frac{1}{n} \sum_{i=1}^n \Theta_{T,i}$

This fusion integrates models tuned for both high precision and high recall pseudo-labels, improving adaptation robustness across unfamiliar and variable domains.

InfoNCE Contrastive Consistency

Within each object cluster, EDAOD applies an InfoNCE-style contrastive loss to enforce feature consistency among detections tracked as the same object across time and viewpoints:

$L_{\text{cl}}^i = - \log \frac{\sum_{f \in \mathbf{p}_i} \exp(\text{sim}(\mathbf{q}_i, f)/\beta)}{\sum_{f \in \mathbf{p}_i} \exp(\text{sim}(\mathbf{q}_i, f)/\beta) + \sum_{f \in \mathbf{n}_i} \exp(\text{sim}(\mathbf{q}_i, f)/\beta)}$

where $\mathbf{q}_i$ is the query feature (teacher), $\mathbf{p}_i$ are its cluster-mate and positive features (student), $\mathbf{n}_i$ negatives, and $\beta$ is the temperature.

A KL divergence loss ensures overall feature distribution alignment:

$L_{kl} = \sum_{k=1}^3 L_{KL}(E^{\text{CLS}_{S,k}}, E^{\text{CLS}_{T,k}})$

The total loss is:

$L_{\text{total}} = L_{mt} + L_{cl} + L_{kl}$

where $L_{mt}$ is the Mean Teacher detection loss.

3. Benchmarking and Evaluation Protocols

The EDAOD benchmark evaluates the pipeline across challenging scenarios:

Same trajectory adaptation: Standard SFDA, i.e., adapt and evaluate on the same room and configuration.
Next layout adaptation: After adaptation to a room’s one configuration, test transfer to a new layout of the same room.
Continual learning: Sequentially adapt and generalize across an entire suite of layouts, monitoring for catastrophic forgetting and cumulative detection improvement.

Datasets include iTHOR (photorealistic simulated homes with 122 object categories) and Matterport3D (real world multi-room buildings with varying lighting/layouts and diverse furniture/objects). Metrics are mean Average Precision (mAP) and AP $_{50}$ .

4. Experimental Results

EDAOD consistently outperforms strong baselines (Mean Teacher, MemCLR, IRG-SFDA, UDAc-SFDA) in all settings:

On iTHOR's Kitchen CL scenario: EDAOD achieves 36.03 AP vs. 29.87 (Mean Teacher) and 28.42 (source-only).
On MP3D-A (real data): EDAOD reaches 22.57 AP vs. 20.26 (source-only) and 18.39 (Mean Teacher).

Ablation studies show each technique—temporal clustering, contrastive learning, and multi-threshold fusion—provides measurable improvements.

Robot experiments in a real laboratory setting under low-light confirm that adaptation almost doubles average detections per class, substantiating EDAOD’s value in non-simulated environments with challenging lighting conditions.

5. Technical Formulations and Key Components

Component	Description
Mean Teacher loss ( $L_{mt}$ )	Consistency between student and teacher under augmentations
Temporal Clustering	Instance tracks over time for robust pseudo-labels
Multi-scale Threshold Fusion	Ensemble models from multiple cluster similarity thresholds
InfoNCE Contrastive Loss	Enforces temporal and viewpoint-invariant consistency within each object identity cluster
KL Divergence ( $L_{kl}$ )	Aligns higher-level proposal feature distributions between student and teacher

Relevant parameter values include $\alpha_1=0.99$ (EMA updating), $\beta=0.1$ (temperature), and instance fusion thresholds $\tau_2\in\{0.8,0.85,0.9\}$ .

6. Applications, Impact, and Future Directions

EDAOD is directly applicable to:

Service and domestic robotics: Enables robots to adapt object detectors in the field without revisiting or storing proprietary training datasets, responding to household layout rearrangements and evolving inventories.
Lifelong perception systems: As agents traverse ever-changing environments, EDAOD’s ability to generalize across new rooms and lighting enables continual, cumulative learning.
Privacy-sensitive settings: Effective source-free adaptation aligns with privacy and data-ownership requirements by obviating the need to share training data.

This suggests that EDAOD’s temporal, open-vocabulary, and source-free approach is uniquely suited for embodied AI, addressing both practical and technical barriers to robust perception in the wild. Future directions include integrating active perception policies and memory-augmented architectures to further accelerate continual adaptation, and exploring meta-learning or memory-based extensions that support even faster adaptation to novel domains.

7. Summary Table: Core Components and Benefits

Technique	Role in EDAOD Pipeline	Impact on Embodied Detection
Temporal clustering	Aggregates object instances across frames	Stabilizes pseudo-labels; utilizes sequence information
Multi-threshold fusion	Balances precision and recall in clustering	Ensures robustness to new domain statistics
Contrastive loss	Maximizes intra-cluster feature consistency	Improves viewpoint/lighting invariance
Source-free Mean Teacher	Enables unsupervised continual adaptation	Adaptation without access to source data

Conclusion

EDAOD represents a landmark methodology and benchmark for robust, sequential, source-free adaptation of object detectors in embodied, open-world settings. By leveraging temporal sequence cues, multi-scale thresholding, and contrastive learning atop an OVOD backbone—all with only target domain data—EDAOD establishes a new baseline for continual, zero-shot, and privacy-preserving object detection in dynamic real environments.

PDF Markdown Chat (Upgrade)

References (1)

Embodied Domain Adaptation for Object Detection (2025)