360Anything: Geometry-Free Lifting of Images and Videos to 360°
Abstract: Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything's deep geometric understanding and broader utility in computer vision tasks. Additional results are available at https://360anything.github.io/.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper introduces OmniDiT, an AI method that turns any normal photo or video (what you see through a regular camera) into a full 360° panorama. A 360° panorama lets you look all around—left, right, up, and down—like you’re standing inside the scene. OmniDiT does this without needing to know anything about the camera that took the picture or video, which is usually required by older methods.
What questions does it try to answer?
- Can we convert regular photos or videos into 360° panoramas without knowing the camera’s settings (like field of view or tilt)?
- Can an AI learn the “geometry” (how parts of the scene relate in 3D) just from data, instead of using hand-crafted rules?
- How can we stop “seams” from showing up where a 360° panorama wraps around?
- Do the generated panoramas work well enough to recreate 3D scenes?
- Can the model guess the camera’s field of view and orientation (tilt/roll) even though it wasn’t trained specifically to do that?
How does it work?
Here’s the big picture: OmniDiT uses a powerful AI called a diffusion transformer to learn how to place a normal view inside a 360° scene and then “fill in” everything you can’t see.
A quick guide to 360° panoramas
- A 360° image is often stored in something called “equirectangular projection” (ERP). Think of how a globe becomes a flat world map—it stretches the sphere into a rectangle. A 360° photo is like that: the left and right edges of the rectangle connect to each other.
- Because the sides wrap around, any visible line where the edges meet is called a “seam,” and it looks bad.
The core idea: learning without geometry info
Older methods usually needed the camera’s details (like field of view, and the angles it was pointing) to line up the input image with the panorama. OmniDiT avoids this by learning directly from data:
- It turns both the input image/video and the target 360° panorama into sequences of small pieces called “tokens” (like cutting a picture into many tiny tiles).
- It then sticks these sequences together and uses a transformer (a type of AI that’s great at learning relationships) to figure out where the input view should go on the 360° canvas and to generate the missing parts.
- This process is done with a diffusion model, which you can think of as starting from a noisy image and gradually “cleaning” it into a sharp panorama, guided by what the input image shows.
To keep things simple and stable, the model is trained to produce “upright” panoramas (the gravity direction points down, like a level horizon). This is called making the output “canonical.” For video training, the authors first adjust (stabilize) each training clip so the “up” direction is consistent.
Avoiding ugly seams
There’s a common problem with 360° panoramas: a visible line (seam) where the left and right edges meet.
- Why it happens: Many modern image generators compress images into a smaller “latent” space using a VAE (a kind of image compressor). Inside this compressor, the edges of an image are padded with zeros (blank space). That creates a mismatch at the wrap-around boundary of a 360° image, which later shows up as seams.
- The fix (Circular Latent Encoding): Before compressing the panorama, the method wraps the image around—copying a strip from the right edge to the left, and from the left edge to the right—so the compressor “sees” a continuous image. After compression, those extra strips are dropped. This preserves smooth wrap-around and gets rid of seams at the source.
Training and data (in everyday terms)
- The model is fine-tuned on lots of paired examples: a 360° scene and a matching normal-view crop from it.
- The camera angles and field of view are varied during training so the model learns to handle many different views.
- For video, they also use realistic camera motions so the model learns stable, smooth panoramas over time.
What did they find? Why it matters?
The authors tested OmniDiT on both images and videos and compared it to strong previous methods.
Main results and why they’re important:
- Better 360° images and videos: OmniDiT produces higher-quality, more realistic panoramas, often beating previous state-of-the-art methods on standard quality metrics. This means crisper details, fewer distortions, and more consistent results.
- No camera info needed: Even without camera settings, it preserves the input view well and places it correctly on the 360° canvas. This makes the method practical for “in-the-wild” content, like random phone videos without any metadata.
- Seam-free panoramas: Their circular encoding fix reduces seam artifacts clearly, improving the look of the final panoramas.
- Strong geometry understanding: Even though it wasn’t trained specifically for it, OmniDiT can estimate camera field of view and orientation fairly accurately. That shows it learned real 3D relationships from data.
- Works for 3D reconstruction: The generated 360° videos can be turned into 3D scenes (using a technique called 3D Gaussian Splatting), letting you “fly through” the scene. This suggests the model’s outputs are not just pretty—they’re also geometrically consistent.
What could this lead to?
- Easier virtual world creation: Turning everyday photos and videos into full 360° panoramas can help build immersive VR/AR spaces, games, and simulations much faster.
- Better tools for filmmakers and creators: You can expand a narrow shot into a whole environment without special cameras or careful setup.
- Useful for robots and mapping: 360° views help robots and drones understand their surroundings; doing this without camera info makes real-world use simpler.
- Stronger 3D vision research: The idea of learning geometry “implicitly” (without handcrafted rules) may inspire similar progress in other 3D tasks.
In short, OmniDiT shows that with enough data and the right architecture, an AI can learn to “think in 3D” and generate high-quality 360° panoramas from ordinary media—reliably, seamlessly, and without needing camera details.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The following points identify what remains missing, uncertain, or unexplored in the paper, phrased to be actionable for future research:
- Training-time reliance on external geometric tools: The “geometry-free” approach still depends on COLMAP and GeoCalib to canonicalize training videos. The impact of failures or noise in these tools on model learning and inference is not quantified. Explore self-supervised or joint learning approaches that estimate gravity and pose without external SfM.
- Sensitivity to canonicalization errors: No analysis of how inaccuracies in per-frame pose or gravity estimation during preprocessing affect downstream video generation quality and temporal stability. Establish error tolerance and robustness bounds.
- Unclear handling of yaw: Camera orientation evaluation reports pitch and roll but omits yaw, although yaw governs placement of the perspective input on the 360° canvas. Develop and evaluate yaw estimation protocols for single images and videos and quantify ambiguity under limited FoV.
- Camera parameter inference via brute-force search: FoV and pose are inferred with exhaustive search over the generated panorama, which is compute-heavy and non-differentiable. Investigate explicit camera-parameter heads, differentiable alignment losses, or amortized inference to replace search.
- Limited camera parameter range: Training samples FoV in [30°, 120°], pitch in [−60°, 60°], roll in [−15°, 15°]. Extrapolation beyond these ranges (e.g., ultra-wide/telephoto FoVs, extreme tilts) is untested. Assess robustness and retraining needs for out-of-range inputs.
- Single-view conditioning only: The method is demonstrated with one perspective image or a single monocular video. Extend to multi-view or mixed-modal conditioning (e.g., multiple images, sparse video clips) and study how the model fuses inconsistent views.
- ERP-only representation: The approach targets ERP despite known vertical distortion near poles. Investigate hybrid or adaptive representations (ERP + cubemap/spherical harmonics) and quantify trade-offs in distortion, seam handling, and learning stability.
- Seam handling limited to horizontal wrap-around: Circular Latent Encoding addresses horizontal seams but not potential artifacts at vertical boundaries/poles. Analyze residual artifacts at the top/bottom, and design pole-aware encoders/decoders.
- VAE dependency and padding choice: CLE is a workaround for zero-padding in a pre-trained convolutional VAE; the VAE itself is not retrained with circular convolutions or boundary-aware padding. Evaluate retraining VAEs with circular or reflection padding and quantify gains vs. CLE.
- CLE hyperparameters unstudied: The padding width w′ = W/8 is fixed heuristically. Conduct sensitivity analyses to w′, resolution scaling, and encoder depth to identify optimal configurations and failure modes.
- Lack of explicit geometric consistency metrics: Beyond PSNR/LPIPS (covered regions) and FVD/VBench (perspective crops), geometric fidelity (e.g., depth/normal consistency, epipolar constraints) is not measured. Introduce panorama-specific geometric metrics and compare with ground-truth panoramas.
- Outpainted content consistency: Semantic and structural consistency of hallucinated regions is not evaluated quantitatively (e.g., object permanence across views/time, horizon alignment). Develop cross-view consistency metrics and identity tracking across temporal panoramas.
- Data domain gaps: Image training data are predominantly synthetic indoor scenes; evaluation includes SUN360, but systematic outdoor/long-range generalization is underexplored. Curate diverse outdoor datasets and analyze cross-domain performance and failure modes.
- Video trajectory diversity: While real trajectories from DynPose100K are added, a systematic study of how trajectory statistics (e.g., speed, rotation rates, shake, zoom) affect generalization is absent. Build trajectory diversity benchmarks and training curricula.
- Temporal consistency and identity preservation: VBench Motion Smoothness is limited; strong metrics for temporal coherence, identity persistence, and temporal geometry consistency are missing. Incorporate optical-flow-based and identity-tracking metrics.
- Controllability of panorama placement and orientation: The model produces gravity-aligned panoramas but does not allow user control over global orientation (yaw), horizon tilt, or where the input view is “placed.” Design conditioning interfaces for explicit control and evaluate trade-offs.
- Text conditioning effects: Captions guide outpainting, yet the impact of prompt quality/length and text-image alignment on geometry and semantics is weakly analyzed (CLIP-score only). Ablate caption usage, prompt engineering, and multimodal alignment losses.
- Efficiency and scalability: Training/inference costs for large DiTs (FLUX/Wan) at higher resolutions and longer videos are not reported. Study memory/latency trade-offs, token sequence length scaling, and efficient attention (e.g., sparse/flash) for high-res, long-duration panoramas.
- Conditioning architecture choices: Sequence concatenation is adopted without a controlled comparison to alternatives (e.g., cross-attention, FiLM, gating, learned alignment modules). Benchmark conditioning strategies under noisy/no-metadata scenarios.
- Placement ambiguity under limited FoV: For very narrow FoVs or textureless scenes, multiple placements can satisfy local evidence. Quantify uncertainty and propose probabilistic placement or multi-hypothesis generation frameworks.
- Robustness to challenging content: Failure cases (e.g., rolling shutter, fast motion blur, specular/transparent surfaces, repetitive patterns) are not presented. Create stress-test suites and analyze systematic weaknesses.
- 3D reconstruction fidelity from generated panoramas: 3DGS results are qualitative and limited to static scenes; no quantitative comparison to reconstructions from ground-truth panoramas is provided. Measure geometric accuracy (e.g., point-cloud distances, mesh metrics) and extend to dynamic scene reconstruction (e.g., Deformable 3DGS).
- Alignment between generated 3D and real scale/metric: Generated panoramas may not be metrically accurate; scale and foreshortening ambiguities are unaddressed. Investigate metric grounding via monocular depth priors or scale cues.
- Panoramic evaluation protocols: Current metrics rely on perspective crops; there is a lack of standardized panorama-native perceptual and distortion metrics. Propose benchmark protocols and human studies in immersive 360° viewers.
- Failure analysis for canonical vs. non-canonical training: The ablation shows trade-offs (PSNR vs. FVD/VBench), but root causes and mitigation (e.g., curriculum learning, pose-conditional training) are not explored. Develop training strategies that combine canonical generation with pose-conditioned preservation.
- Dataset openness and reproducibility: Dependencies on non-open datasets/captions (Gemini 2.5) and external preprocessing pipelines may hinder replication. Provide open, standardized training/evaluation corpora and share preprocessing scripts.
- Safety and downstream risks: Hallucinated 360° content may mislead robotics/AR applications; the paper does not discuss safety or uncertainty quantification. Explore confidence estimation, abstention policies, and risk-aware generation for downstream use.
Practical Applications
Overview
OmniDiT introduces a geometry-free method to convert arbitrary perspective images and videos into seamless, gravity-aligned 360° panoramas without camera metadata, and a Circular Latent Encoding (CLE) technique that eliminates seam artifacts at the equirectangular boundary. Below are practical applications that leverage these findings and methods, grouped by deployment horizon.
Immediate Applications
These can be deployed with current models and workflows, typically as offline or near–real-time tools, with standard GPU resources and standard software integrations.
- 360° outpainting for post-production and VFX
- Sectors: Media & Entertainment, Advertising, Creative Studios
- What: Expand a single frame or clip into a full 360° plate for immersive edits, set extensions, and immersive promo materials.
- Tools/products/workflows: Plugins for Adobe After Effects/Premiere, Nuke; standalone batch converters; web APIs for “360-from-anything.”
- Assumptions/dependencies: Offline compute; outputs are generative and not ground-truth—suitable for creative, not documentary fidelity; rights to source media.
- Skybox/environment map creation from a single image
- Sectors: Gaming, VR/AR, VFX, 3D Rendering
- What: Generate 360° environment maps (skyboxes) from concept art, location photos, or frames for fast world-building and IBL setups.
- Tools/products/workflows: Unity/Unreal plugins; Blender add-on; DCC pipeline nodes.
- Assumptions/dependencies: LDR output (not HDR); for physically accurate lighting, post-processing or HDR-specific training is needed.
- Social media and creator tools for 360 content
- Sectors: Social Platforms, UGC Apps, Marketing
- What: One-click conversion of photos or short videos to 360 posts/stories; immersive promos.
- Tools/products/workflows: Mobile app filters; web uploaders with 360 viewers; YouTube 360 conversion filter.
- Assumptions/dependencies: Resolution used in the paper (~1024×2048 images, ~512×1024 videos) is acceptable for social but not premium VR.
- Real estate and tourism: 360 tours from smartphone clips
- Sectors: Real Estate, Travel & Hospitality
- What: Turn a short handheld walkthrough into a 360 panorama and optional static 3D walk-through using 3D Gaussian Splatting (3DGS).
- Tools/products/workflows: Mobile capture app + cloud processing (OmniDiT → COLMAP rig + 3DGS); realtor CMS integration.
- Assumptions/dependencies: Works best in static scenes; generated context may differ from reality—disclosure required; sufficient lighting/texture.
- Fast VR/AR prototyping with canonicalized 360 backgrounds
- Sectors: VR/AR, UX Prototyping, Training Demos
- What: Use gravity-aligned, seam-free panoramas as backgrounds for demos and design iterations.
- Tools/products/workflows: Drop-in assets for Unity/Unreal/Three.js; simple pipeline from a reference shot to usable 360.
- Assumptions/dependencies: Generative completion is not metrically accurate; best for prototyping and mood-setting.
- Zero-shot camera metadata recovery for datasets and pipelines
- Sectors: Academia (Computer Vision), Software Tooling, Digital Asset Management
- What: Estimate FoV, pitch, roll from a single image by generating a panorama and optimizing alignment; fill missing EXIF/calibration for SfM/SLAM or curation.
- Tools/products/workflows: CLI/library that wraps OmniDiT and a grid search; integration with COLMAP preprocessing.
- Assumptions/dependencies: Extra compute (panorama generation + search); accuracy is strong but not SOTA in all domains; domain shift can affect results.
- Panoramic dataset augmentation for model training
- Sectors: ML Engineering, Content Generation
- What: Augment training corpora for panoramic generation/editing models with seam-free ERP samples using Circular Latent Encoding.
- Tools/products/workflows: Training pipeline modification to add CLE (circular padding in VAE encoder); synthetic data generation to expand diversity.
- Assumptions/dependencies: Access to VAE encoder internals; retraining or fine-tuning required; data licenses for redistribution.
- 3D scene flythroughs from monocular videos
- Sectors: Creative Tech, Education, Cultural Heritage
- What: Outpaint to full 360 and reconstruct static scenes with 3DGS for interactive flythroughs and exhibits.
- Tools/products/workflows: OmniDiT → cubemap rig + COLMAP → vanilla 3DGS; web viewers for interactive exploration.
- Assumptions/dependencies: Static scenes only (vanilla 3DGS); generated content may not match the physical scene; pose estimation quality matters.
- Seam elimination in panoramic diffusion models via CLE
- Sectors: ML R&D, Video/Imaging Product Teams
- What: Adopt Circular Latent Encoding to eliminate ERP boundary seams during training (works for images and videos).
- Tools/products/workflows: Replace zero-padding with circular padding around the ERP longitude in VAE encoder; drop padded latent columns post-encoding.
- Assumptions/dependencies: Access to model/VAE training code; retrain or fine-tune to benefit.
- Gravity-aligned panorama canonicalization as a preprocessing service
- Sectors: Software/Media Tooling, Archive Digitization
- What: Convert arbitrarily oriented panoramas or 360 captures into a consistent, upright orientation for downstream use.
- Tools/products/workflows: Batch processor aligning gravity and removing seams; integration with DAMs and 360 viewers.
- Assumptions/dependencies: Performance depends on scene cues; failure cases in texture-poor or highly symmetric scenes.
- Automated metadata recovery for photo/video archives
- Sectors: Digital Preservation, Newsrooms, Libraries
- What: Recover approximate camera FoV and horizon orientation for archival assets with missing metadata to aid cataloging and search.
- Tools/products/workflows: Batch estimation pipeline; indexing by estimated intrinsics/extrinsics.
- Assumptions/dependencies: Domain shift (e.g., historical film stock) may reduce accuracy; verification recommended.
- General equirectangular ML use-cases beyond 360 photos (seam-free training)
- Sectors: Earth Observation, Weather/Climate, Cartography
- What: Apply CLE to any ERP-like data (global maps) to improve generative/denoising models without longitude seams.
- Tools/products/workflows: Modify VAE encoder padding; retrain existing ERP generative models.
- Assumptions/dependencies: ERP-like topology must hold; access to training code and data.
Long-Term Applications
These require further scaling, research, or engineering (e.g., higher resolutions, real-time performance, HDR, trust & safety controls), or carry higher risk if deployed naively.
- Real-time or low-latency 360 conversion for live streaming
- Sectors: Media, Social, Live Events
- Potential products: Edge-optimized or distilled OmniDiT variants; GPU/ASIC inference on-device; live 360 broadcast filters.
- Dependencies: Model distillation/acceleration; streaming-grade reliability; compute costs; quality trade-offs.
- HDR panoramic lighting for physically-based rendering
- Sectors: VFX, ArchViz, Games
- Potential products: HDR skybox/IBL generation from a photo; auto-exposure and radiometric calibration.
- Dependencies: HDR-capable training data and loss; tone mapping/inverse tone mapping; photometric consistency.
- Robotics situational awareness from monocular cameras
- Sectors: Robotics, Surveillance, Drones
- Potential products: 360 situational awareness overlays; map completion for teleoperation UIs; training data augmentation for 360 perception models.
- Dependencies: Reliability under domain shift; minimizing hallucinations; safety validation; real-time constraints.
- Immersive sports/news replays from sparse viewpoints
- Sectors: Sports Broadcasting, Journalism
- Potential products: Reconstructed 360 sequences from single/limited cameras for AR/VR replays.
- Dependencies: Accuracy and provenance concerns; editorial standards; legal/ethical review to avoid misleading reconstructions.
- AR cloud and map completion
- Sectors: AR Platforms, Mapping
- Potential products: Fill missing environment context for spatial anchors and cloud maps; improved user experience in sparse scans.
- Dependencies: Metric consistency; robust geometry under dynamic scenes; joint optimization with SLAM.
- Camera calibration “as-a-service” replacing supervised estimators
- Sectors: Computer Vision, Photogrammetry, OEMs
- Potential products: API for robust FoV/horizon estimation across domains; batch pipeline for camera fleets.
- Dependencies: Broader training and benchmarking; corner-case handling (fisheye, extreme FoV); uncertainty estimates.
- 8K+ 360 image/video generation for premium headsets
- Sectors: Premium VR, Cinematic XR
- Potential products: High-res, seam-free, gravity-aligned 360 assets; stitched sequences for immersive films.
- Dependencies: Scaling DiTs and VAEs; memory and training data; temporal stability at high resolutions.
- 4D world generation for simulation and training
- Sectors: Simulation, Training, Gaming
- Potential products: Use OmniDiT outputs to bootstrap 3D/4D scene reconstructions for interactive sims; coupling with dynamic 3DGS/NeRFs.
- Dependencies: Handling dynamics and moving subjects; consistent multi-view geometry; long-sequence stability.
- Policy and provenance frameworks for synthetic 360 content
- Sectors: Policy, Platforms, Media Standards
- Potential products: Disclosure guidelines, C2PA-based provenance for 360 assets; labeling tools integrated in export pipelines; detection/forensics.
- Dependencies: Cross-industry adoption; clear UX for disclosure; regulatory alignment.
- Education, training, and therapy content at scale
- Sectors: Education, Public Safety Training, Healthcare (e.g., exposure therapy)
- Potential products: Rapid creation of immersive learning environments from commodity footage.
- Dependencies: Quality control and content validation; ethical review; domain-specific guardrails.
Key Assumptions and Dependencies (cross-cutting)
- Generative fidelity and safety: Outputs are plausible but may hallucinate missing regions; not suitable for safety-critical decisions without verification.
- Domain coverage: Current models are trained primarily on specific domains (e.g., indoor images; mixed video datasets). Outdoor/extreme cases may require fine-tuning.
- Compute: Diffusion transformers are compute-intensive; many uses are currently offline or batch. Real-time use needs model optimization.
- Resolution and dynamics: Paper results focus on ~1–2K ERP and static-scene 3D recon; professional VR and dynamic 3D require scaling and new methods.
- Licensing and rights: Use of source media and model weights must comply with licenses; generated content should include disclosure where appropriate.
- Integration effort: Circular Latent Encoding requires access to (and modifications of) the VAE encoder in training; camera estimation via OmniDiT adds inference overhead.
Glossary
- 3D Gaussian Splatting (3DGS): A fast differentiable 3D scene representation/rendering technique that models scenes with anisotropic Gaussian primitives for efficient optimization and view synthesis. "The generated consistent panoramas enable 3D scene reconstruction via 3D Gaussian Splatting (row 3)."
- Bundle adjustment: A nonlinear optimization process that jointly refines camera poses and 3D structure by minimizing reprojection error; “constrained” indicates additional priors/constraints are imposed (e.g., rig geometry). "and perform constrained bundle adjustment using a cubemap rig."
- Canonical Coordinate constraint: A training constraint that enforces all generated panoramas to be in a standard, gravity-aligned upright orientation regardless of input pose. "Instead, we enforce a Canonical Coordinate constraint, for which the model is trained to generate panoramas in a standard, gravity-aligned upright orientation, regardless of the camera pose of the input $X_{pers$."
- Camera intrinsics and extrinsics: Intrinsics are internal camera parameters (e.g., focal length/FoV), while extrinsics describe the camera’s pose (position and orientation) in a scene. "We propose OmniDiT, a novel DiT-based architecture for ``in-the-wild'' perspective to canonical panorama generation that implicitly infers camera intrinsics and extrinsics, eliminating the need for camera calibration."
- Channel concatenation: A conditioning strategy that concatenates inputs along the channel dimension (e.g., stacking latents) to align conditioning and target features. "Consequently, when off-the-shelf camera estimators fail, channel-concatenation approaches break down completely due to the reliance on pixel-aligned conditioning; see \Cref{app-subset:channel-concat-failure}."
- Circular Latent Encoding: The paper’s technique for seam-free panorama generation that applies circular padding before VAE encoding and removes the padded latent parts, ensuring circular continuity in latent space. "We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation."
- Circular padding: A padding scheme where pixels wrap around (e.g., left edge pads from right edge), preserving periodic continuity and avoiding boundary artifacts. "We propose a simple solution that uses circular padding when encoding VAE latents."
- CLIP-FID: A variant of FID computed on CLIP embeddings to assess semantic similarity between generated and real images. "To measure visual quality, we report Fréchet Inception Distance (FID)~\cite{FID}, Kernel Inception Distance (KID)~\cite{KID}, FID on CLIP~\cite{CLIP} features (CLIP-FID), and FID on features of an auto-encoder fine-tuned on panorama images (FAED)~\cite{PanFusion}."
- CLIP-score (CS): A metric that uses CLIP to measure alignment between an image and a text prompt (higher is better). "We also report CLIP-score (CS)~\cite{CLIPScore} for text alignment."
- COLMAP: A widely used structure-from-motion and multi-view stereo toolbox for estimating camera poses and sparse/dense reconstructions. "We first apply COLMAP~\cite{Colmap} to estimate per-frame camera pose, and rotate each frame to have zero rotation relative to the first frame."
- Cubemap representation: A spherical image representation using six square faces (front, back, left, right, up, down) to reduce equirectangular distortion. "A few works explore the cubemap representation to eliminate large distortions inherent in the ERP~\cite{CubeDiff, DreamCube}."
- Diffusion transformer (DiT): A diffusion model architecture that replaces U-Nets with transformers operating on token sequences for denoising. "We implement the denoiser as a diffusion transformer (DiT)~\cite{DiT, FLUX, Wan2.1}."
- Discontinuity score (DS): A metric for quantifying seams or discontinuities at the boundaries of panoramic images. "We report the discontinuity score~(DS)~\cite{OmniFID-DS} to quantify seam artifacts."
- Equirectangular Projection (ERP): A mapping from spherical imagery to a 2D rectangle where longitude and latitude map to horizontal and vertical coordinates, respectively. "Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space."
- FAED: A panorama-specific metric computed as FID over features of an auto-encoder trained on panoramic imagery. "To measure visual quality, we report Fréchet Inception Distance (FID)~\cite{FID}, Kernel Inception Distance (KID)~\cite{KID}, FID on CLIP~\cite{CLIP} features (CLIP-FID), and FID on features of an auto-encoder fine-tuned on panorama images (FAED)~\cite{PanFusion}."
- Field-of-View (FoV): The angular extent of the observable scene captured by the camera; larger FoV covers more of the environment. "However, this strategy requires known camera metadata, such as Field-of-View (FoV) and camera pose (yaw, pitch, roll) \cite{Argus}."
- Flow matching: A generative modeling framework that learns a velocity/denoising field to transport noise to data along a continuous path. "We adopt the flow matching framework~\cite{FlowMatching, RFSampler}, which learns a denoiser that maps from the standard normal distribution to the distribution of panorama data $Y_{equi \sim p_{\text{data}$."
- Forward diffusion process: In diffusion models, the corruption process that incrementally adds noise to clean data according to a noise schedule. "The forward diffusion process adds noise to clean data to obtain a noisy input $Y_{equi$ at time ; i.e., $Y_{equi = (1-t) Y_{equi + t \bm{\epsilon}$."
- Fréchet Inception Distance (FID): A distributional distance between generated and real images computed in Inception feature space (lower is better). "To measure visual quality, we report Fréchet Inception Distance (FID)~\cite{FID}, Kernel Inception Distance (KID)~\cite{KID}, FID on CLIP~\cite{CLIP} features (CLIP-FID), and FID on features of an auto-encoder fine-tuned on panorama images (FAED)~\cite{PanFusion}."
- FVD: Fréchet Video Distance; a video-level quality metric comparing the statistics of generated and real videos in a learned feature space. "We also report FVD~\cite{FVD}, Imaging Quality, Aesthetic Quality, and Motion Smoothness from VBench~\cite{VBench} to evaluate overall quality."
- GeoCalib: A method for estimating camera orientation (e.g., gravity direction) from a single image or video. "Then, we run GeoCalib~\cite{GeoCalib} to estimate the global gravity direction of the stabilized video, and rotate the video to align the gravity direction with the vertical axis."
- Gravity-aligned: Refers to orienting panoramas so that the vertical axis corresponds to gravity, producing upright, canonical views. "OmniDiT lifts arbitrary perspective images (row 1) and videos (row 2) to seamless, gravity-aligned 360 panoramas."
- Inductive bias: Model architecture assumptions that bias learning toward certain solutions; geometric inductive biases encode camera/spherical geometry. "prior works often rely on strong geometric inductive biases, such as explicitly projecting the perspective input to the target Equirectangular Projection (ERP) space"
- Kernel Inception Distance (KID): An MMD-based alternative to FID computed on Inception features to assess image quality/diversity. "To measure visual quality, we report Fréchet Inception Distance (FID)~\cite{FID}, Kernel Inception Distance (KID)~\cite{KID}, FID on CLIP~\cite{CLIP} features (CLIP-FID), and FID on features of an auto-encoder fine-tuned on panorama images (FAED)~\cite{PanFusion}."
- Latent space: The compressed feature space learned by a VAE (or similar), where diffusion models often operate for efficiency. "To generate samples at a high resolution, modern diffusion models typically operate in the latent space of a pre-trained convolution-based VAE~\cite{LDM}."
- LPIPS: Learned Perceptual Image Patch Similarity; a perceptual metric measuring distance between images in deep feature space (lower is better). "To measure input preservation, we report PSNR and LPIPS~\cite{LPIPS} between ground-truth and generated panorama videos within regions covered by the perspective video."
- Outpainting: Generating new content outside the bounds of the provided image or video crop to complete a larger field of view. "Given a perspective video with frames, $X_{pers \in \mathbb{R}^{T \times h \times w \times 3}$ (we treat image as a special case with ) and a caption , our goal is to outpaint a 360 panoramic video $Y_{equi \in \mathbb{R}^{T \times H \times W \times 3}$."
- Patchified: The process of dividing a latent/image into fixed-size patches and flattening them into a sequence for transformer input. "The latent representation $y_{equi$ is then patchified and flattened into a 1D sequence of tokens that is provided as input to the DiT."
- Pixel-aligned conditioning: Conditioning where the input perspective image is geometrically projected into ERP so its pixels align with the target panorama. "they typically project $X_{pers$ into the ERP space to obtain $X_{equi$, which is pixel-aligned with the generation target $Y_{equi$."
- PSNR: Peak Signal-to-Noise Ratio; a fidelity metric measuring the ratio between maximum signal power and reconstruction error (higher is better). "To measure input preservation, we report PSNR and LPIPS~\cite{LPIPS} between ground-truth and generated panorama videos within regions covered by the perspective video."
- Rig-based COLMAP: A COLMAP setup that models multiple views as a rigid rig to jointly optimize their poses with shared constraints. "we employ rig-based COLMAP~\cite{ColmapRigs, Colmap} on the generated panoramic video to obtain camera poses."
- Rotated denoising: An inference-time trick for panoramas that cyclically shifts the image across diffusion steps to mitigate seam artifacts. "Prior works often attribute this to the generation process, employing inference time tricks such as rotated denoising (shifting the panorama cyclically across sampling steps)~\cite{PanoWan, PanoDiffusion}."
- Sequence concatenation: Conditioning strategy that concatenates token sequences of conditioning inputs and targets so the transformer attends across them jointly. "Instead of enforcing spatial correspondence via projection into the ERP space, we employ a simple sequence concatenation mechanism inspired by recent image editing models~\cite{Qwen-Image, FLUX-Kontext}."
- Spherical attention: An attention mechanism adapted to spherical image geometry to enable cross-view information exchange. "Imagine360~\cite{Imagine360} duplicates the denoising U-Net in AnimateDiff~\cite{AnimateDiff} to process panorama and perspective views separately, connected by spherical attention for information exchange."
- Spherical convolutions: Convolutional operations defined on the sphere, used to better handle panoramic distortions. "To handle the geometric properties of panoramas, they often inject strong inductive bias such as spherical convolutions~\cite{SphericalPer2Pano1, SphericalPer2Pano2}"
- Structure-from-Motion (SfM): A pipeline that reconstructs 3D structure and camera motion from multiple images. "We thank Noah Snavely and Richard Tucker for implementing the pipeline for running structure-from-motion at scale."
- Timestep shifting: An inference technique that shifts the diffusion timestep schedule for improved sampling quality/stability. "At inference time, we use FLUX's default sampler~\cite{RFSampler} with 50 sampling steps and timestep shifting of 3.16."
- Token sequences: Discrete vectors fed into a transformer, obtained by patchifying and flattening latents/images. "By treating the perspective input and the panorama target simply as token sequences, OmniDiT learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information."
- Variational Autoencoder (VAE): A latent variable model used to encode/decode images; diffusion often operates in its latent space for efficiency. "To generate samples at a high resolution, modern diffusion models typically operate in the latent space of a pre-trained convolution-based VAE~\cite{LDM}."
- VBench: A benchmark suite providing video evaluation metrics like Imaging Quality, Aesthetic Quality, and Motion Smoothness. "We also report FVD~\cite{FVD}, Imaging Quality, Aesthetic Quality, and Motion Smoothness from VBench~\cite{VBench} to evaluate overall quality."
- Zero-padding: Padding with zeros at convolution boundaries, which can introduce edge artifacts in feature maps and latents. "VAEs utilize zero-padding in convolutional layers~\cite{CNNPosInfo}."
- Zero-shot: Performing a task without task-specific training or labeled supervision on that task. "Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating OmniDiT's deep geometric understanding and broader utility in computer vision tasks."
Collections
Sign up for free to add this paper to one or more collections.


