Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos (2510.18489v1)

Published 21 Oct 2025 in cs.CV

Abstract: We introduce Mono4DGS-HDR, the first system for reconstructing renderable 4D high dynamic range (HDR) scenes from unposed monocular low dynamic range (LDR) videos captured with alternating exposures. To tackle such a challenging problem, we present a unified framework with two-stage optimization approach based on Gaussian Splatting. The first stage learns a video HDR Gaussian representation in orthographic camera coordinate space, eliminating the need for camera poses and enabling robust initial HDR video reconstruction. The second stage transforms video Gaussians into world space and jointly refines the world Gaussians with camera poses. Furthermore, we propose a temporal luminance regularization strategy to enhance the temporal consistency of the HDR appearance. Since our task has not been studied before, we construct a new evaluation benchmark using publicly available datasets for HDR video reconstruction. Extensive experiments demonstrate that Mono4DGS-HDR significantly outperforms alternative solutions adapted from state-of-the-art methods in both rendering quality and speed.

Summary

The paper introduces a unified two-stage Gaussian optimization framework for reconstructing renderable 4D HDR scenes from unposed, alternating-exposure monocular videos.
It leverages vision foundation model priors for initializing depth, motion, and dynamic masks, ensuring robust camera pose refinement and temporally consistent HDR reconstruction.
Quantitative benchmarks demonstrate superior performance in rendering quality and speed, with notable improvements in PSNR, SSIM, LPIPS, and HDR-TAE over prior methods.

High Dynamic Range 4D Gaussian Splatting from Alternating-Exposure Monocular Videos

Introduction and Motivation

Mono4DGS-HDR introduces a unified framework for reconstructing renderable 4D HDR scenes from unposed monocular LDR videos with alternating exposures. The method addresses the practical scenario where dynamic HDR scenes are captured using a single handheld camera, without access to camera poses or multi-camera setups. Previous HDR NVS approaches have been limited to static scenes or required known camera parameters, and existing dynamic scene methods are not robust to alternating-exposure inputs. Mono4DGS-HDR is the first to tackle this challenging task, leveraging a two-stage optimization based on Gaussian Splatting and integrating vision foundation model priors for initialization and regularization.

Figure 1: Mono4DGS-HDR reconstructs high-quality 4D HDR scenes from unposed monocular LDR videos with alternating exposures, outperforming direct extensions of prior methods.

Methodology

Vision Foundation Model Priors

The framework begins by extracting 2D priors from vision foundation models, including video depth estimation, sparse long-term 2D pixel trajectories, and dynamic masks derived from epipolar error maps. These priors are computed using off-the-shelf models (DepthCrafter, SpatialTracker, RAFT) and are effective even with alternating-exposure inputs, though they remain noisy and incomplete. Bundle adjustment is performed using static tracklets to obtain initial camera parameters.

Two-Stage Gaussian Optimization

Mono4DGS-HDR employs a two-stage optimization strategy:

Video HDR Gaussian Training: Dynamic HDR Gaussians are learned in an orthographic camera coordinate space, eliminating the need for camera poses. This stage enables robust initial HDR video reconstruction and provides consistent brightness across frames, facilitating reliable camera pose optimization via photometric reprojection error.
Video-to-World Gaussian Transformation and World Gaussian Fine-Tuning: The learned video Gaussians are transformed into world space using initial camera parameters. The transformation includes dynamic/static identification with occlusion handling, attribute transformation (position, rotation, scaling), and re-fitting of Gaussian scaling based on 2D covariance invariance. World Gaussians are then jointly optimized with camera poses, leveraging HDR photometric reprojection loss for dense supervision.
Figure 2: Overview of Mono4DGS-HDR, showing 2D prior extraction, two-stage Gaussian optimization, and transformation from video to world Gaussians.

Figure 3: Video-to-world Gaussian transformation strategy, including dynamic/static identification, attribute transformation, and scaling re-fitting.

Temporal Luminance Regularization

To ensure temporal consistency of HDR appearance, a temporal luminance regularization (TLR) strategy is introduced. TLR uses flow-guided photometric loss to align per-pixel HDR irradiance between consecutive frames, propagating well-supervised dynamic content to poorly-supervised times and stabilizing appearance variations.

Loss Functions and Optimization

The overall objective combines LDR RGB loss, depth loss, flow/track loss, unit exposure loss, Gaussian motion regularization (as-rigid-as-possible, velocity, acceleration), temporal luminance regularization, and HDR photometric reprojection loss. Gaussian densification is performed in both stages, with dynamic Gaussian pruning in the second stage.

Experimental Results

Benchmark and Evaluation

A new benchmark is constructed from publicly available datasets, comprising synthetic and real-world dynamic scenes with alternating-exposure LDR videos. Evaluation metrics include PSNR, SSIM, LPIPS, and HDR-TAE for temporal stability.

Quantitative and Qualitative Performance

Mono4DGS-HDR demonstrates superior performance over adapted baselines (GaussHDR, HDR-HexPlane, MoSca, SplineGS, GFlow) in both rendering quality and speed. Notably, it achieves higher PSNR, SSIM, lower LPIPS, and lower HDR-TAE across all tracks. GaussHDR and HDR-HexPlane are limited by their inability to handle dynamic scenes and monocular videos, while GFlow and SplineGS fail to recover HDR scenes due to their reliance on per-frame optimization and photometric reprojection loss, respectively.

Figure 4: HDR visual comparisons on train/test frames, showing superior quality of Mono4DGS-HDR.

Figure 5: Additional HDR visual comparisons on train/test frames, further demonstrating quality improvements.

Figure 6: HDR visual comparisons under fix-view-change-time setting.

Figure 7: HDR visual comparisons under fix-time-change-view setting.

Ablation Studies

Ablation experiments validate the effectiveness of key components:

Video Gaussian Initialization: Removing this stage leads to a significant PSNR drop (>1dB) and visual degradation.
Occlusion Handling and 2D Covariance Invariance: Ablating these in the transformation process results in inaccurate dynamic/static separation and suboptimal Gaussian scaling, degrading reconstruction quality.
HDR Photometric Reprojection Loss: Its removal impairs camera pose and geometry refinement.
Temporal Luminance Regularization: Discarding TLR does not affect PSNR but severely degrades temporal stability (HDR-TAE), causing artifacts and noise in dynamic regions.
Figure 8: Induced 2D/3D tracking by dynamic Gaussian motion in Mono4DGS-HDR.

Implementation Details

The system is implemented in PyTorch, optimized with Adam, and trained on a single RTX 3090 GPU. The pipeline integrates codebases from 3DGS, MoSca, SaV, GaussHDR, SpatialTracker, DepthCrafter, and RAFT. Gaussian attributes are initialized from 2D priors, and dynamic Gaussian trajectories are modeled using cubic Hermite splines. Tone-mapping is performed via MLPs, and loss weights are empirically tuned for stability and convergence.

Limitations and Future Directions

Mono4DGS-HDR relies on the quality of 2D priors; inaccuracies in depth, flow, or track estimation can lead to suboptimal geometry and motion reconstruction. Dynamic mask extraction from optical flow may imperfectly separate static and dynamic content. The method does not address image blur from fast motion, though recent works (Deblur4DGS, Casual3DHDR) suggest possible extensions for simultaneous deblurring and HDR reconstruction.

Implications and Prospects

Mono4DGS-HDR advances the state-of-the-art in monocular 4D HDR scene reconstruction, enabling novel-view rendering and temporally consistent HDR video synthesis from casually captured, unposed LDR videos. The two-stage optimization and integration of vision foundation model priors provide a robust framework for dynamic scene understanding under challenging exposure conditions. Future work may focus on improving prior extraction, handling motion blur, and extending the approach to more complex exposure patterns and real-time applications.

Conclusion

Mono4DGS-HDR establishes a new paradigm for HDR dynamic scene reconstruction from unposed monocular videos with alternating exposures. Its unified two-stage Gaussian optimization, video-to-world transformation, and temporal regularization yield state-of-the-art results in both quality and speed, with strong implications for practical HDR video synthesis and dynamic scene modeling.