Diminished Reality in XR

Updated 11 May 2026

Diminished Reality is the selective removal or attenuation of physical stimuli to simplify perception and lower information density.
DR employs techniques like deep learning-based inpainting, transformer models, and sensor fusion to achieve real-time performance.
Applications span surgical visualization, automotive safety, and mobile AR, enhancing user cognition and situational awareness.

Diminished Reality (DR) is a paradigm within the mediated and mixed reality spectrum defined by the selective removal, attenuation, or occlusion of real-world stimuli from the user’s perceptual scene. Unlike Augmented Reality, which adds information, or Virtual Reality, which replaces reality altogether, DR “subtracts” physical objects, visual clutter, or sensory input via computational means. It is quantitatively characterized as a reduction in information density (entropy) and/or perceptual frustum, resulting in a mediated environment with decreased cognitive demands on the user. DR has become a central research theme in eXtended Reality (XR), pervasive computing, medical visualization, and human–machine interaction.

1. Foundational Principles and Theoretical Formalizations

DR is formally situated within the virtuality continuum and the taxonomy of mediated realities. In Mann’s framework, DR occupies the region of the Atoms–Bits–Genes (physical–virtual–social) XR-cube near the origin (α ≈ 0, β ≈ 0), indicating minimized physical and virtual stimuli (Mann et al., 2024). The key information-theoretic concept is blending entropy $E_b$ , defined as:

$E_{b} = \int_{F} \rho(x)\,dx$

where $\rho(x)$ is the local information (entropy) density at point $x$ and $F$ is the perceptual frustum—the region in space that the user must integrate for the current task. DR operations are characterized by selective reduction of $\rho(x)$ in specific regions $R$ (via removal, suppression, or inpainting), which results in an overall entropy reduction $\Delta E_b = \int_{R} \rho_{\mathrm{real}}(x) dx$ (Tiefenbacher et al., 2016).

The regime of “pure” DR corresponds to maximal attenuation, bordering sensory deprivation (float tanks, blackout hoods), while “partial” DR makes use of selective object removal or suppression (e.g., hiding distracting signage, muting non-essential noise) (Mann et al., 2024).

2. Computational Methods and System Architectures

Most DR systems comprise object segmentation, region selection, and generative or analytic inpainting/inference. The leading methodologies include:

Learning-Based RGB(-D) Inpainting: DeepDR introduces an architecture jointly inpainting color and depth with structure-aware fusion, fusing RGB and geometry via a shared SPADE (Spatially-Adaptive Denormalization) decoder, gated convolutions, and recurrent ConvLSTM state for temporal consistency. Losses span adversarial, perceptual, style, depth gradient, segmentation, and temporal coherence, yielding frame rates ∼225 fps and outperforming prior work on LPIPS, FID, MAE, and RMSE metrics (Gsaxner et al., 2023).
Zero-Shot Video Inpainting: Optimised ProPainter adapts transformer-based, recurrent flow-completion networks for temporal coherence in surgical video DR, employing multi-frame deformable convolutions and mask-guided sparse attention, with measured improvement on W-FID, W-MAE, PSNR, and LPIPS against image-inpainting and diffusion baselines (Li et al., 2024).
Volumetric Multi-Layer Rendering: Medical DR systems synthesize layered views (foreground, recovered background, and tomographic images) via offline or GPU-accelerated TSDF fusion, multi-pass raytracing, and depth-difference masking. Foreground occluders are located via per-pixel depth deviation from a background model; second-run raytracing reconstructs occluded background, enabling adjustable transparency and compositing (Habert et al., 2017).
Block-Based and Temporal Models: The half-diminished reality approach builds a stable background model from fixed camera input using block-wise stability detectors, ring-buffer delay, and per-pixel α-blending, yielding synchronized transparent overlays of e.g. hands manipulating objects (Okumoto et al., 2017).
Real-Time Edge/Cloud-Offloaded DR: For wearable/mobile DR, an end-to-end client–edge pipeline offloads instance segmentation, inpainting (e.g., Poisson or U-Net), and 3D pose estimation to a server, maintaining sub-33 ms latencies suitable for 30 fps interactive use (Ke et al., 2023). Object substitution is realized by mapping preserved pose to virtual replacements after region removal.
Automotive DR Pipelines: MIRAGE implements a modular Unity pipeline with YOLO11s-seg for real-time object segmentation, DepthAnything for per-pixel depth estimation, MI-GAN-based inpainting, and post-processing for compositing diminished visualizations (e.g., in a vehicle’s HMD), at up to 38.9 fps on commodity GPUs (Jansen et al., 27 Jan 2026).

3. Evaluation Metrics, Artifacts, and System Performance

Diminished Reality quality is quantitatively measured through:

Metric	Description	Paper
LPIPS ↓ / PSNR ↑ / MAE ↓	Perceptual similarity, pixelwise error, and mean absolute error	(Gsaxner et al., 2023)
FID / W-FID ↓	Distributional match of generated/inpainted content to real	(Li et al., 2024)
Depth RMSE ↓	Geometric fidelity of inpainted depth	(Gsaxner et al., 2023)
Segmentation mIoU ↑	Mask accuracy for instance/object segmentation	(Jansen et al., 27 Jan 2026)
Frame Rate (fps), Latency (ms)	Runtime performance, e.g., DeepDR 4.43 ms/256², MIRAGE 20–39 fps	(Gsaxner et al., 2023 Jansen et al., 27 Jan 2026)
Recovery Rate (%)	Fraction of occluded pixels successfully replaced in background recovery	(Habert et al., 2017)

Artifacts can arise due to segmentation errors (residuals or missed occlusions), context-incoherent or temporally inconsistent inpainting (flicker, ghosting), or optical misalignment (parallax, blending artifacts in OST-HMDs). In the medical domain, accuracy of anatomical structure recovery is critical, with recovery rates reported from 45% to 97% depending on occluder proximity (Habert et al., 2017).

Practical pipelines trade spatial detail, runtime, and memory; e.g., mask expansion, resolution downsizing, and asynchronous pipeline scheduling are standard optimizations.

4. Applications Across Domains

DR has been demonstrated across multiple real-world and experimental platforms:

Surgical Visualization: Automated removal of hands/instruments from intraoperative video to enhance field awareness and enable fused anatomical/fluoroscopic overlays (Habert et al., 2017 Li et al., 2024).
Automotive Safety and Cognitive Load Reduction: Real-time removal (inpainting or desaturation) of irrelevant or distracting objects in the driver’s field of view, supporting higher-level situation awareness and decluttering (Jansen et al., 27 Jan 2026).
Office and Daily Life: DiminishAR’s AR-based camouflage and substitution interventions (HoloLens 2) visually “erase” distractors (e.g., smartphones), leading to cognitive performance matching that of physically removing the object. This is achieved via 3D pose estimation, background-matched texturing, and contextual occlusion (Lee et al., 2024).
Frontline and Industrial Work: DR-enabled glasses for welding (attenuating all but the arc), construction sites (removing non-critical visual/auditory distractions), or emergency response (suppressing irrelevant channels) (Mann et al., 2024).
Mobile and Metaverse: Edge-offloaded object removal/substitution on mobile AR headsets, enabling real-time scene adaptation and interaction in multi-user virtual–physical spaces (Ke et al., 2023).

5. Technical and Theoretical Challenges

Key research challenges include:

Temporal Coherence: Mitigating flicker and structural discontinuities in video DR remains difficult; ConvLSTM-based temporal encoding and optical-flow-based temporal losses are active research directions (Gsaxner et al., 2023 Li et al., 2024).
Structure Preservation: Faithful completion of complex geometry and semantics demands cross-modal (RGB–depth), structure-aware fusion (e.g., SPADE decoders, multi-stream networks) and robust semantic priors (Gsaxner et al., 2023).
Latency and Wearability: Mobile, low-power, and real-time DR systems must minimize end-to-end latency (target <50 ms) and energy, requiring edge-efficient networks and selective cloud computation (Ke et al., 2023 Mann et al., 2024).
Segmentation and Masking: Reliable mask generation is foundational; performance bottlenecks and error rates are directly related to mIoU and the quality of object proposals. In automotive safety, segmentation IoU must approach 100% to avoid “false omission” of critical entities (Jansen et al., 27 Jan 2026).
Sensor Fusion: Combining RGB, depth, audio, and inertial data is needed for robust multi-modal attenuation, especially for full-context DR (audio, visual, spatial) (Mann et al., 2024).
Ethics and Safety: Selective reality manipulation introduces risks of information suppression, bias, and “dark patterns”; explicit icons and opt-in mechanisms are necessary for user and bystander transparency (Mann et al., 2024 Jansen et al., 27 Jan 2026 Lee et al., 2024).

6. Comparative Performance and Empirical Outcomes

Empirical studies demonstrate DR’s impact across both technical and user-centric metrics. In (Lee et al., 2024), AR-based DR (camouflage or contextual substitution) reduced the cognitive load of smartphone presence, yielding OSPAN and RSPM scores statistically indistinguishable from the “physically removed” condition (e.g., OSPAN scores µ≈13.4 for DR vs. µ=14.5 for removal). Temporal robustness, user experience evaluations, and recovery rates have likewise been benchmarked in medical and operational settings (Habert et al., 2017 Jansen et al., 27 Jan 2026).

User studies highlight both technical efficacy (plausibility, minimal flicker, frame rates >200 fps for DeepDR (Gsaxner et al., 2023)) and subjective factors (habituation to holograms, demand for controllability and observability of DR state).

7. Future Directions

The trajectory of DR research is toward ubiquitous, context-aware, and controllable systems:

Advanced Segmentation and Inpainting: Integration of transformer models, multi-scale attention, and improved shadow/object mask extraction (e.g., SAM) (Gsaxner et al., 2023 Li et al., 2024).
Domain-Adaptation and Few-Shot Learning: Medical and safety-critical DR will incorporate domain-specific adaptation for improved anatomical/scene plausibility (Li et al., 2024).
Continuous Blending-Entropy Optimization: Dynamic adaptation of $E_b$ via real-time user/task modeling to maintain optimal cognitive load (Tiefenbacher et al., 2016).
Multi-Modal and Multi-User Systems: Synchronization and consistent background/pose sharing in collaborative XR environments (Ke et al., 2023).
Wearable Architectures: Move to light (<50 g), real-time (≥60 fps) “smart eyeglasses” integrating AI for scene understanding and adaptive DR (Mann et al., 2024).
Ethical/Regulatory Frameworks: Guidelines for transparency, control, and user consent to mitigate social, safety, and privacy risks (Mann et al., 2024 Lee et al., 2024).

Diminished Reality, by leveraging deep generative modeling, volumetric perception, and real-time AI, is positioned as a central component of future XR systems, enabling context-driven information attenuation for improved cognition, safety, and attention management.