Papers
Topics
Authors
Recent
2000 character limit reached

Video See-Through HMD Systems

Updated 31 December 2025
  • Video see-through HMD technologies are systems that digitally reconstruct both real-world views and synthetic content using forward-facing cameras and internal displays.
  • They face practical challenges such as spatial warping from lens distortions and measurable latencies (around 21 ms), which affect precise user interaction.
  • Mitigation strategies involve software pre-warping, hardware optimizations, and personalized calibration to enhance spatial accuracy and improve user experience.

Video see-through (VST) head-mounted display (HMD) technologies enable mediated perception of the physical world by streaming digitized video from forward-facing cameras to internal screens positioned before each eye. Distinct from optical see-through (OST) systems, which overlay virtual elements onto a minimally modified view of the real world, VST HMDs reconstruct both real and synthetic environments pixelwise, introducing a series of perceptual deviations. These deviations arise from camera and lens distortions, display characteristics, and systematic latency, each contributing to spatial warping and temporal imprecision experienced during real-world interaction tasks (Lange et al., 2024). The following sections provide a comprehensive technical overview of the architecture, experimental evaluation, quantitative findings, implications for system design, and prospects for further research.

1. System Architecture and Deviation Sources

The VST HMD capture pipeline utilizes two forward-facing cameras (±10 cm in front of the eyes) to sample real-world imagery. Camera intrinsics (focal length, aperture, resolution) and extrinsics fundamentally shape the sampled view. Radial pincushion distortion resulting from both camera optics and the HMD's non-Fresnel eye-lenses induces spatial warping, which increases toward the image periphery. Captured frames are processed by the device’s graphics pipeline (e.g., Varjo Base) and rendered onto internal Bionic™ micro-OLED displays situated behind ultra-wide non-Fresnel lenses.

Display pipeline introduces compounding distortion. Display warping (additional pincushion), altered brightness, contrast, and color further deteriorate correspondence between mediated and direct vision. Latency from frame acquisition to display output is measured at 21.13 ms ± 3.83%. Additional error sources include vergence–accommodation conflict from the HMD’s fixed 1.5 m focal plane, motion-parallax mismatches under head movement, restricted field-of-view (FOV) relative to natural vision, and calibration errors, such as head-tracking jitter and interpupillary distance (IPD) mis-calibration. In contrast, OST HMDs use transparent optics for minimal real-world distortion, indicating that VST’s compounded distortions impact perception of both real and virtual elements equally.

2. Experimental Protocols for Perceptual Evaluation

A structured human-in-the-loop experiment was implemented to quantify perceptual deviations in a static VST-HMD usage scenario (Lange et al., 2024). Two conditions were evaluated:

  • No-HMD (baseline): Participants viewed and interacted with a touchscreen using normal eyesight, supported by a chin-rest for immobilization.
  • HMD (VST view): Participants wore a statically-mounted Varjo XR-3, with head tracking disabled to isolate static distortion effects.

Interaction took place on a 6 × 8 grid displayed on a Surface Studio touchscreen, located 35 cm from the HMD/camera plane. For each trial, 35 pseudo-random markers were flashed sequentially for 2 seconds; following marker disappearance, participants were given 3 seconds to tap the remembered location (“blind reaching”). Two sequences (totaling 70 trials per condition) were completed, recording only the first touch per marker, with missed responses marked if no touch was registered within the window.

Spatial error measurement for each trial ii was formalized as: di=(xi,reportedxi,target)2+(yi,reportedyi,target)2d_i = \sqrt{(x_{i,\mathrm{reported}} - x_{i,\mathrm{target}})^2 + (y_{i,\mathrm{reported}} - y_{i,\mathrm{target}})^2} Response latency per trial was computed as ti=ti,touchti,flashstartt_i = t_{i,\mathrm{touch}} - t_{i,\mathrm{flash\, start}}. Statistical comparisons including paired t-tests and one-way ANOVAs were performed across conditions and spatial regions of the screen.

3. Quantitative Findings on Perceptual Deviations

Empirical results demonstrated increased positional inaccuracy and response latency with video see-through HMDs:

Condition Mean Distance Error (mm) SD (mm) Mean Response Time (s) SD (s)
No-HMD 6.529 2.131 2.126 0.134
VST-HMD 11.817 3.813 2.232 0.169

Paired t-tests confirmed significantly larger mean distance errors in the VST condition (t(24)=8.762t(24)=–8.762, p<0.001p<0.001) and longer response times (t(24)=3.715t(24)=–3.715, p=0.001p=0.001). Spatial analysis revealed error magnification towards periphery:

  • Horizontal extremes (left/right columns): 13.44 mm vs. center
  • Vertical extremes (top/bottom rows): 13.39 mm vs. middle

One-way ANOVA confirmed significant spatial variation (F(3,69)=14.966F(3,69)=14.966, p<0.001p<0.001; F(2,46)=13.225F(2,46)=13.225, p<0.001p<0.001). No significant across-row/column error was shown in the baseline condition.

Learning effects were limited; mean distance error did not significantly decrease from first to second trial sequence, but response times diminished across sequences (no-HMD: t(24)=2.399t(24)=2.399, p=0.025p=0.025; VST-HMD: t(24)=4.747t(24)=4.747, p<0.001p<0.001). This suggests motor adaptation to the task but persistent static perceptual inaccuracies.

4. Design, Calibration, and User Experience Implications

Mitigation strategies for VST perceptual errors are software- and hardware-centric:

  • Distortion compensation: Apply inverse barrel pre-warping to camera feeds to counteract pincushion effects; employ per-device calibration by mapping 2D error fields via the touchscreen task and correcting spatial output through point-wise look-up tables before rendering.
  • Hardware adjustments: Optimize lens geometry (fresnel/aspheric), refine camera intrinsics (less distortion, higher resolution), and minimize end-to-end latency through enhanced sensors, efficient codecs, and higher refresh-rate panels.
  • User experience adaptation: Provide visual feedback trials (reticle/cursor) to facilitate adaptation to static spatial distortions, integrate haptic cues or boundary reminders to ameliorate depth and scale underestimations, and utilize interactive onboarding calibration for personalized distortion correction.

A plausible implication is that combining both low-level (optical, electronic) and high-level (software, calibration, UX) interventions is required to approach the fidelity of direct perception in VST-mediated tasks.

5. Comparative Analysis with Optical See-Through HMDs

OST systems, employing transparent optics, offer minimal distortion of the real world, relegating perceptual deviation predominantly to virtual overlay registration and luminance contrast issues. In VST systems, the real and virtual are rendered equivalently on the same pixel matrix, leading to all captured distortions—optical and electronic—affecting both domains. This fundamental architectural difference underscores the unique calibration, compensation, and system evaluation requirements for VST technology, particularly in applications requiring precise real-world correspondence.

6. Recommendations and Future Research Directions

Further research initiatives are recommended to address the persistent spatial and temporal deviations identified in VST HMD systems (Lange et al., 2024):

  • Apply evaluation protocols to a broader range of VST headsets (including consumer-grade models) to characterize and compare distortion profiles.
  • Expand to dynamic scenarios involving mobile users and 6 DOF hand-tracking, focusing on motion-parallax, motion-to-photon latency, and dynamic calibration.
  • Isolate depth-perception factors via experiments manipulating vergence–accommodation conflict with adjustable focal planes, quantifying depth estimation (under/overestimation effects).
  • Conduct closed-loop compensation studies testing effectiveness of both software (distortion pre-warping) and hardware (optical optimizations) pipelines in reducing spatial errors and cybersickness.
  • Develop real-time adaptive calibration methods leveraging machine learning to continuously refine spatial distortion maps based on observed user interaction (e.g., blind-reaching performance).

These directions reflect a consensus that achieving perceptual parity with direct vision in VST HMDs will require iterative, cross-disciplinary advances in optics, realtime graphics, AI-powered calibration, and human factors engineering.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Video See-Through HMD Technologies.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube