OmniDepth: 360° Monocular Depth Estimation

Updated 4 July 2026

OmniDepth is a learning-based system for dense monocular depth estimation on 360° equirectangular panoramas, addressing spherical distortions and global context.
It introduces the 360D dataset, a large rendered corpus combining synthetic and scanned indoor scenes, to facilitate supervised training.
Its innovative architectures, UResNet and RectNet, directly predict radial depth maps and outperform perspective-based models through distortion-aware design.

Searching arXiv for the primary paper and closely related omnidirectional depth estimation works to ground the article and citations. Search result focus:

"OmniDepth: Dense Depth Estimation for Indoors Spherical Panoramas" (Zioulis et al., 2018)
"OmniFusion: 360 Monocular Depth Estimation via Geometry-Aware Fusion" (Li et al., 2022)
"360MonoDepth: High-Resolution 360° Monocular Depth Estimation" (Rey-Area et al., 2021)
"Cross-Domain Synthetic-to-Real In-the-Wild Depth and Normal Estimation for 3D Scene Understanding" (Bhanushali et al., 2022)
"Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model" (Endres et al., 30 Mar 2025) OmniDepth is a learning-based system for dense monocular depth estimation on omnidirectional indoor panoramas in equirectangular format, together with a large rendered dataset, 360D, for supervised training on 360° imagery. It addresses a specific failure mode of monocular depth models trained on projective images: such models assume a pinhole projection with uniform sampling and local context within a limited field of view, whereas equirectangular panoramas represent the entire sphere, exhibit latitude-dependent distortions, preserve wrap-around continuity at the left and right borders, and require global context for effective reasoning. In this setting, OmniDepth introduced direct end-to-end prediction on spherical panoramas rather than treating 360° images as a sequence of perspective views or cubemap faces, and it established a baseline for subsequent omnidirectional depth research (Zioulis et al., 2018).

1. Problem setting and conceptual scope

OmniDepth was introduced to close the gap between monocular depth estimation for perspective images and dense depth estimation for omnidirectional panoramas. The core claim is empirical as well as architectural: state-of-the-art monocular depth models trained on projective images produce suboptimal results on equirectangular inputs, even when the spherical image is split into perspective faces via cubemap projection. This is attributed to the mismatch between conventional convolutional assumptions and spherical image structure, especially nonuniform distortion toward the poles, seam continuity, and the need for broader spatial context than narrowly local receptive fields can capture (Zioulis et al., 2018).

The system is framed as dense monocular depth estimation directly in the 360 domain. In the terminology of the original work, the input is a single 512×256 equirectangular RGB panorama and the output is a dense depth map of the same size. The task is restricted to indoor scenes, and the depth variable is the radial distance from the camera center to the visible surface point along the viewing ray, rather than the pinhole $z$ -buffer quantity used in conventional perspective rendering.

This formulation placed OmniDepth among contemporaneous 360° vision methods that had emphasized stereo spherical matching, layout estimation, saliency, or classification, often through cubemap projections or spectral spherical CNNs. Its defining distinction was to demonstrate dense monocular depth estimation trained and executed directly on equirectangular panoramas, with distortion-aware design and large-scale synthetic-plus-scan training data (Zioulis et al., 2018).

2. Dataset construction and rendering pipeline

A central component of OmniDepth is the 360D dataset, created by rendering spherical panoramas from existing indoor 3D datasets and pairing RGB panoramas with ground-truth omnidirectional depth maps. The stated motivation was practical: acquiring high-quality 360° RGB-D data is difficult because of sensor occlusions, low resolution and visible scanners, and inconsistent stereo baselines. The dataset therefore re-used two synthetic CAD datasets, SunCG and SceneNet, and two realistic scanned datasets, Stanford2D3D and Matterport3D, rendered with the Cycles path tracer (Zioulis et al., 2018).

The rendering procedure placed a spherical camera and a uniform point light at the same position $c$ in the scene. For CAD houses, the camera and light were placed at the center of each house; for scans, camera poses came from the scanning process, yielding multiple panoramas per building. Data augmentation was performed by camera rotations by $90^\circ$ , generating 4 distinct viewpoints per pose. In total, 11,118 SunCG houses were used, corresponding to 24.36% of the full set, and all scenes from Stanford2D3D and Matterport3D were rendered, yielding 94,098 panorama renders and 23,524 unique viewpoints. After filtering, the final train set comprised 34,679 RGB panoramas with depth, while SceneNet was used entirely as validation and the initial held-out test set contained 1,298 samples (Zioulis et al., 2018).

The dataset also encoded invalid regions explicitly. Missing geometry in scans or CAD scenes produced holes or infinity-valued depths, which were marked with a binary mask $M(p)$ . Scenes were filtered to exclude samples with more than 5% of pixels having depth greater than 20 m or less than 0.5 m. Depth values were stored in meters. The dataset was released publicly at http://vcl.iti.gr/360-dataset/.

Component	Specification	Role
3D sources	SunCG, SceneNet, Stanford2D3D, Matterport3D	Synthetic and scanned indoor scenes
Rendering engine	Cycles path tracer	Panorama and depth generation
Input/output resolution	512×256	Equirectangular RGB and depth
Total renders	94,098	Rendered panoramas
Unique viewpoints	23,524	Distinct camera poses
Final train set	34,679	RGB panoramas with depth

The scale of 360D was significant in context because it was considerably larger than similar projective datasets, and it provided a supervised training resource specifically matched to spherical imagery rather than repurposed perspective supervision.

3. Spherical geometry and depth representation

OmniDepth adopts the conventional equirectangular-to-sphere parameterization, but it makes explicit that the stored depth is radial distance. Let the equirectangular image have width $W$ and height $H$ . Longitude $\lambda \in [-\pi,\pi)$ and latitude $\phi \in [-\pi/2,\pi/2]$ map to pixel coordinates

$u = W \cdot \frac{\lambda + \pi}{2\pi}, \qquad v = H \cdot \frac{\phi + \pi/2}{\pi}.$

Conversely, given pixel $(u,v)$ ,

$c$ 0

The unit direction vector is

$c$ 1

A visible 3D point along the ray is then

$c$ 2

where $c$ 3 is the ground-truth depth stored in the spherical $c$ 4-buffer (Zioulis et al., 2018).

This distinction matters because the depth map is not a perspective depth image reparameterized onto the sphere. A pixel corresponds to a 3D point via a radial ray from the camera center. In practical terms, the model estimates the full-sphere field of radial distances, and invalid regions are excluded by the mask $c$ 5 during both training and evaluation.

The representation also exposes an important limitation of standard CNNs. Although equirectangular images provide a dense rectangular tensor, their sampling density is not uniform over the sphere. Near the poles, equal image-space areas correspond to different angular extents than near the equator. OmniDepth therefore treated spherical distortions as a first-class architectural concern rather than as a nuisance to be absorbed by generic convolution alone.

4. Architectures and supervised optimization

OmniDepth introduced two fully convolutional encoder-decoders: UResNet and RectNet. Both use ELU activations throughout, avoid batch normalization, and predict dense depth maps from a single 512×256 equirectangular RGB panorama. UResNet is an unbalanced ResNet-style encoder-decoder with skip connections and a shallower decoder. The encoder contains two input convolution blocks followed by four downscaling residual blocks, and the decoder contains one upscaling block and three up-prediction blocks that generate multi-scale depth predictions. Its effective receptive field is 190×190. RectNet, by contrast, was designed specifically for spherical panoramas, using latitude-adaptive preprocessing, limited downsampling, and dilated convolutions to enlarge the receptive field to approximately half of the input dimensions, reaching 266×276 (Zioulis et al., 2018).

RectNet’s early preprocessing blocks concatenate outputs from square and rectangular convolution filters, with rectangle aspect ratios varying across rows to respect latitude-dependent distortion while preserving filter area and total output channel count. The network reduces spatial resolution by only 4× in order to retain dense features. Within its dilation blocks, 1×1 convolutions decorrelate spatial features and increase capacity without adding spatial distortion sensitivity. Despite having far fewer parameters than UResNet, approximately 8.8M versus approximately 51.2M, RectNet achieved better accuracy, which the paper attributes to the larger receptive field and distortion-aware design.

Training is fully supervised with perfect rendered ground-truth depth $c$ 6, with invalid or infinite regions ignored by the mask $c$ 7. The depth and smoothness terms are

$c$ 8

$c$ 9

The final masked multi-scale objective is

$90^\circ$ 0

For UResNet, the loss weights are $90^\circ$ 1; for RectNet, $90^\circ$ 2. Optimization used Caffe, a single NVIDIA Titan X GPU, Xavier initialization, and ADAM with default parameters $90^\circ$ 3 and initial learning rate 0.0002 (Zioulis et al., 2018).

Architecture	Design emphasis	Receptive field	Parameters
UResNet	ResNet-style encoder-decoder, multi-scale prediction	190×190	~51.2M
RectNet	Distortion-aware preprocessing, dilations, limited downsampling	266×276	~8.8M

The architectural contrast became one of the enduring lessons of the work: for omnidirectional depth estimation, receptive field structure and distortion handling mattered more than depth or parameter count alone.

5. Evaluation protocols and quantitative results

Evaluation used standard depth metrics—Abs Rel, Sq Rel, RMSE, RMSE(log), and threshold accuracies $90^\circ$ 4, $90^\circ$ 5, and $90^\circ$ 6—with invalid pixels masked by $90^\circ$ 7. For cross-method comparison where predictions came from models trained at different scales, median scaling was applied using

$90^\circ$ 8

The paper reported results on the held-out test set, on SceneNet as unseen synthetic validation, and under an “S2R” protocol that trains on synthetic SunCG, fine-tunes on Matterport3D, and evaluates on Stanford2D3D and SceneNet (Zioulis et al., 2018).

On the test set after 10 epochs with the full train set, UResNet obtained Abs Rel 0.0835, Sq Rel 0.0416, RMSE 0.3374, RMSE(log) 0.1204, $90^\circ$ 9 0.9319, $M(p)$ 0 0.9889, and $M(p)$ 1 0.9968. RectNet improved these values to Abs Rel 0.0702, Sq Rel 0.0297, RMSE 0.2911, RMSE(log) 0.1017, $M(p)$ 2 0.9574, $M(p)$ 3 0.9933, and $M(p)$ 4 0.9979. On SceneNet, RectNet again outperformed UResNet, with Abs Rel 0.1077 versus 0.1218 and RMSE 0.3572 versus 0.4066. Under the S2R protocol on Stanford2D3D, RectNet-S2R reached Abs Rel 0.0824 and RMSE 0.3998, compared with UResNet-S2R at 0.1226 and 0.4890 (Zioulis et al., 2018).

The paper also compared OmniDepth against projective-image monocular baselines applied directly to equirectangular images and to cubemap faces. Directly on equirectangular images, the baselines reported much worse errors: Godard et al. had Abs Rel 0.4747, Laina et al. 0.3181, and Liu et al. 0.4202. Even under cubemap projection, which was more favorable to perspective-trained models, OmniDepth remained superior. The per-face evaluation showed the same pattern: RectNet achieved Abs Rel 0.0080 and RMSE 0.1113, versus 0.0453 and 1.6559 for Godard et al., 0.0300 and 0.3152 for Laina et al., and 0.0312 and 0.3048 for Liu et al. These results were used to argue that training directly in the 360 domain, with architectures tailored to equirectangular distortion and global context, was more effective than adapting perspective models post hoc (Zioulis et al., 2018).

Qualitatively, OmniDepth also generalized to unseen real 360° panoramas from the Sun360 “Room” and “Indoors” splits, producing plausible depth maps. Among the projective baselines, only FCRN, associated in the paper with Laina et al., yielded somewhat reasonable outputs, whereas the others suffered from severe distortions and lack of global reasoning.

6. Limitations, applications, and later developments

OmniDepth was restricted to indoor scenes, and its rendering assumptions introduced several realism gaps: camera vertical alignment, constant lighting, and baked illumination in scans reduced variability. Scan and CAD imperfections caused holes that required masking. Extreme latitudes remained difficult because distortions are strongest there, and wrap-around continuity was only implicit in the equirectangular representation, since standard spatial convolutions did not enforce circular padding at seams (Zioulis et al., 2018).

The paper nevertheless identified several application domains for dense omnidirectional depth: VR and AR scene understanding, including novel view synthesis, stereoscopic rendering, and 3D object compositing; indoor robotics and navigation; and 3D mapping in omnidirectional settings. Its future directions included unsupervised learning via view synthesis tailored to 360° videos, GAN-based realism enhancement or adversarial domain adaptation for synthetic-to-real transfer, and broader outdoor datasets.

Subsequent work expanded this line of research in several distinct directions. “360MonoDepth” (Rey-Area et al., 2021) revisited the problem through tangent images to support 2K and 4K panoramas, using deformable multi-scale alignment and gradient-domain blending. “OmniFusion” (Li et al., 2022) combined tangent-image processing, geometry-aware feature fusion, transformer-based global aggregation, and iterative refinement to mitigate spherical distortion. “Cross-Domain Synthetic-to-Real In-the-Wild Depth and Normal Estimation for 3D Scene Understanding” (Bhanushali et al., 2022) extended omnidirectional depth estimation to outdoor settings and joint depth-normal prediction using the OmniHorizon dataset and UBotNet. In stereo settings, “Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model” (Endres et al., 30 Mar 2025) integrated a pre-trained monocular depth foundation model into omnidirectional stereo matching. This suggests that OmniDepth’s lasting contribution was not only a first direct equirectangular monocular baseline, but also a formulation of omnidirectional depth estimation as a geometry-specific learning problem rather than a minor variant of perspective depth estimation.

In historical terms, OmniDepth is best understood as the point at which dense depth estimation for 360° panoramas became a dedicated research area with its own data, geometry, and architectural assumptions. Its combination of a rendered training corpus, spherical depth definition, direct equirectangular prediction, and distortion-aware design established a reference framework against which later omnidirectional depth methods continued to differentiate themselves (Zioulis et al., 2018).