WorldLens: 3D Rendering and Benchmarking

Updated 4 July 2026

WorldLens is a dual-interface concept that enables real-time 3D Gaussian Splatting rendering with interactive navigation and a comprehensive benchmark for evaluating driving world models.
It employs advanced techniques like automatic image-based lighting, efficient collision detection, and training-rendering co-design to optimize performance and reduce Gaussian count significantly.
The evaluation benchmark assesses world models across visual, geometric, behavioral, and human-alignment metrics, providing actionable insights through standardized dimensions such as PSNR, SSIM, and LPIPS.

In recent arXiv literature, WorldLens denotes two technically distinct but conceptually related systems: a rendering-and-interaction layer for 3D Gaussian Splatting worlds in HY-World 2.0 (HY-World et al., 15 Apr 2026), and a full-spectrum benchmark ecosystem for evaluating driving world models across appearance, geometry, control, downstream utility, and human judgment (Kong et al., 11 May 2026, Liang et al., 11 Dec 2025). In both usages, the term refers to an interface between a generated or reconstructed world and its practical use: in one case through real-time exploration, character movement, collision handling, and lighting control; in the other through systematic measurement of whether generated worlds are visually convincing, geometrically coherent, behaviorally reliable, and human-aligned. A broader reading of the surrounding literature suggests that the term sits within a larger family of “lens” concepts in machine learning and imaging, including perspective-aware latent-variable modeling, eye-perspective rendering, and lensless or de-lensing image formation (Dinakar et al., 2022, Emsenhuber et al., 15 Sep 2025, Bezzam et al., 2022, Sabella, 2022).

1. WorldLens as a term in contemporary machine learning and graphics

Within the 2025–2026 literature represented here, WorldLens is not a single standardized object. One strand uses the name for a runtime platform that makes generated 3D worlds explorable and character-ready in the HY-World 2.0 pipeline (HY-World et al., 15 Apr 2026). Another uses it for a benchmark, dataset, and evaluation agent for driving world models, designed to assess whether generated worlds behave like coherent worlds rather than merely resembling realistic videos (Kong et al., 11 May 2026, Liang et al., 11 Dec 2025).

The shared conceptual core is that both systems sit after world generation. HY-World 2.0 already performs panorama generation, trajectory planning, world expansion, and world composition; WorldLens is the final “experience” layer that supports “interactive exploration of 3D worlds with character support” (HY-World et al., 15 Apr 2026). In the driving-world-model literature, WorldLens is introduced because existing metrics emphasize FID, FVD, LPIPS, and similar appearance-oriented measures, while often missing whether the generated world preserves geometry, obeys physics, supports planners, or aligns with human realism judgments (Kong et al., 11 May 2026).

This suggests a useful distinction between two meanings of the term. One meaning is deployment-facing: a rendering stack that turns a reconstructed 3D asset into a navigable world. The other is evaluation-facing: an assessment stack that turns a generated driving video into a multi-axis measurement of world fidelity. The coincidence of naming is not accidental in spirit, even though the systems differ in architecture and domain.

2. WorldLens in HY-World 2.0: rendering-and-interaction layer for 3D worlds

In HY-World 2.0, WorldLens is described as “a high-performance 3D Gaussian Splatting (3DGS) rendering platform” that enables “interactive exploration of 3D worlds with character support” (HY-World et al., 15 Apr 2026). It is the rendering-and-interaction layer of the larger HY-World 2.0 system, which accepts text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. The world assets rendered by WorldLens come from HY-World 2.0’s composition stage, including the optimized 3DGS scene itself and a mesh extracted from it.

The platform is explicitly characterized by four capabilities: a flexible engine-agnostic architecture, automatic image-based lighting (IBL), efficient collision detection, and training-rendering co-design (HY-World et al., 15 Apr 2026). These features position WorldLens not as a static viewer but as a runtime layer for user-controlled navigation, physically plausible character motion, and lighting control. Because the design is described as engine-agnostic, it is intended to interface with different deployment targets rather than remain tied to a single engine runtime.

WorldLens is built around Gaussian-splat-based rendering rather than mesh-rasterization-based rendering. The final scene is composed as a 3DGS representation initialized from the expanded point cloud $\mathbf{\tilde{P}$, while the extracted mesh is used “as underlying collision proxies” and as a geometric aid for navigation and interaction (HY-World et al., 15 Apr 2026). The paper explicitly states that WorldLens supports “real-time collision detection and physically plausible feedback,” which is necessary for movement through complex environments such as stairs and indoor layouts.

Automatic IBL is one of the platform’s principal rendering-quality features. The paper states that WorldLens provides “automatic IBL lighting”, meaning that environment lighting is inferred and applied automatically rather than authored manually. The detailed lighting pipeline is not fully specified, but the claim is that lighting control is integrated into the rendering stack so that outputs are not only geometrically correct but also perceptually coherent in brightness, color balance, and environmental illumination (HY-World et al., 15 Apr 2026).

The paper is comparatively sparse about WorldLens as a standalone renderer. It does not report a dedicated ablation isolating the engine-agnostic design or the automatic IBL module, nor does it provide separate rendering FPS or latency numbers for the platform itself (HY-World et al., 15 Apr 2026). As a result, WorldLens is best understood as a platform-level integration layer rather than a separately benchmarked neural renderer.

3. Rendering architecture, collision proxies, and training-rendering co-design

The design of WorldLens is tightly coupled to the world composition stage of HY-World 2.0. That stage reconstructs aligned depths via WorldMirror 2.0, fuses them into an extended point cloud $\mathbf{\tilde{P}$, and then optimizes a 3DGS model whose Gaussians are parameterized by opacity $\sigma_k$ , mean $\boldsymbol{\mu}_k$ , and covariance $\mathbf{\Sigma}_k = \mathbf{R}_k \mathbf{S}_k \mathbf{S}_k^T \mathbf{R}_k^T$ (HY-World et al., 15 Apr 2026). The renderer is trained with the combined objective

$\mathcal{L}_{\text{GS} = \mathcal{L}_{\text{color} + \mathcal{L}_{\text{geo} + \mathcal{L}_{\text{reg} + \mathcal{L}_{\text{mask}.$

The photometric term uses $\mathcal{L}_1$ , SSIM, and LPIPS, while the geometric term supervises depth and normals (HY-World et al., 15 Apr 2026).

For WorldLens, the most consequential design theme is training-rendering co-design. The paper states explicitly that WorldLens is not an isolated renderer; it benefits from optimization choices made upstream to improve runtime rendering quality and efficiency (HY-World et al., 15 Apr 2026). The primary example is the combination of adaptive densification with MaskGaussian. The authors argue that uniform voxel downsampling reduces Gaussian count but harms high-frequency detail, whereas ordinary densification restores detail but introduces too many Gaussians and produces floaters, especially in sky regions where depth supervision is missing.

To address this, the Gaussian cloud is partitioned into sky and scene subsets; standard growth is applied only to the scene subset; then MaskGaussian is used to probabilistically prune redundant Gaussians. The paper defines a binary mask $M_k \in \{0,1\}$ sampled via Gumbel-Softmax and renders with masked transmittance: $\mathbf{c}(\mathbf{x}) = \sum_{k=1}^{N} M_k \, \mathbf{c}_k \, \sigma_k \, T_k,\quad T_{k+1} = T_k (1 - M_k \sigma_k).$ A sparsity regularizer

$\mathcal{L}_{\mathrm{mask} = \lambda_m \left(\frac{1}{N}\sum_{k=1}^{N}M_k\right)^2$

encourages compact representations (HY-World et al., 15 Apr 2026).

This optimization is directly relevant to WorldLens because it reduces active Gaussian count and therefore lowers rasterization cost. Averaged over 10 scenes, the paper reports that the full configuration reduces the Gaussian count by about 77% relative to the 6M baseline while keeping quality close: 25.023 PSNR / 0.747 SSIM / 0.215 LPIPS versus the 6M baseline’s 25.176 / 0.751 / 0.209 (HY-World et al., 15 Apr 2026). Elsewhere, the paper describes the same efficiency-quality tradeoff more concretely as a reduction from 5.254M Gaussians to 1.383M with only a 0.14 dB PSNR drop (HY-World et al., 15 Apr 2026).

Collision handling follows a hybrid rendering-geometry design. Earlier in the HY-World 2.0 pipeline, a low-resolution panoramic mesh is computed for strict collision detection during trajectory planning, and later a mesh is extracted from the optimized 3DGS via TSDF fusion and marching cubes (HY-World et al., 15 Apr 2026). WorldLens uses these meshes as collision proxies so that users or characters do not pass through solid structures. The 3DGS therefore provides appearance, while the mesh provides the collision surface.

4. WorldLens as a benchmark for driving world models

A second and independent use of WorldLens appears in the driving-world-model literature, where it names a unified benchmark intended to evaluate whether generated driving worlds are realistic not only visually but also geometrically, behaviorally, and perceptually (Kong et al., 11 May 2026, Liang et al., 11 Dec 2025). The motivating problem is that world models can generate photorealistic dash-cam or multi-view driving videos yet still fail under reconstruction or closed-loop planning. The benchmark is introduced to address the gap between how real generated worlds appear and whether they behave realistically (Kong et al., 11 May 2026).

WorldLens is organized around five complementary aspects and 24 standardized dimensions. The five aspects are Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference (Kong et al., 11 May 2026, Liang et al., 11 Dec 2025). The stated goal is to evaluate generated driving worlds across the pipeline from pixel fidelity to 4D geometric consistency, closed-loop planning utility, and human perceptual realism (Kong et al., 11 May 2026).

The benchmark’s central claim is that current models are specialists rather than all-rounders. Across evaluations of models including MagicDrive, DreamForge, DriveDreamer-2, OpenDWM, DiST-4D, and $\mathbf{\tilde{P}$0-Scene, the papers state that no single model dominates across all axes (Kong et al., 11 May 2026, Liang et al., 11 Dec 2025). The trade-offs are systematic: texture-rich models violate geometry, geometry-aware models may lack behavioral fidelity, and even the strongest models receive human realism ratings of only around 2–3 out of 10 (Kong et al., 11 May 2026).

The benchmark also reframes what counts as fidelity. Rather than treating world-model evaluation as a matter of clip realism alone, it asks whether a model builds a world that is reconstructable, actionable, useful to perception systems, and aligned with human judgment (Liang et al., 11 Dec 2025). This is a notably different use of the term WorldLens from HY-World 2.0, but the conceptual role is again that of an interface layer—here between generative outputs and their practical assessment.

5. Evaluation protocol, quantitative structure, and WorldLens-Agent

The WorldLens driving benchmark distributes its 24 dimensions across the five aspects in a fixed schema (Kong et al., 11 May 2026).

Aspect	Number of dimensions	Representative dimensions
Generation	8	Subject Fidelity, Temporal Consistency, Cross-View Consistency
Reconstruction	4	Photometric Error, Geometric Discrepancy, Novel-View Quality
Action-Following	4	Displacement Error, PDMS, Route Completion, Arena Driving Score
Downstream Task	4	Map Segmentation, NDS, AMOTA, SparseOcc RayIoU
Human Preference	4	World Realism, Physical Plausibility, 3D/4D Consistency, Behavioral Safety

The Generation aspect includes Subject Fidelity, Subject Coherence, Subject Consistency, Depth Discrepancy, Temporal Consistency, Semantic Consistency, Perceptual Discrepancy, and Cross-View Consistency (Kong et al., 11 May 2026). The Reconstruction aspect asks whether a coherent 3D or 4D world can be recovered from the generated sequence, using Photometric Error, Geometric Discrepancy, Novel-View Quality, and Novel-View Discrepancy (Kong et al., 11 May 2026). The Action-Following aspect measures Displacement Error, Open-Loop Adherence via PDMS, Route Completion, and Closed-Loop Adherence via Arena Driving Score (Kong et al., 11 May 2026). The Downstream Task aspect uses BEVFusion, NDS, AMOTA, and SparseOcc RayIoU to determine whether synthetic videos are useful for real perception systems (Kong et al., 11 May 2026). Human Preference scores the four subjective dimensions on a 1–10 scale (Kong et al., 11 May 2026, Liang et al., 11 Dec 2025).

The later paper gives explicit formulas for many of these metrics. For example, Perceptual Discrepancy is written as an FVD-style Fréchet distance,

$\mathbf{\tilde{P}$1

while Route Completion is

$\mathbf{\tilde{P}$2

and Closed-Loop Adherence is

$\mathbf{\tilde{P}$3

The reconstruction protocol is likewise explicit: generated videos are lifted into a 4D Gaussian field, re-rendered at original training poses and novel camera poses, and then compared against the relevant views (Liang et al., 11 Dec 2025).

A distinctive component of the ecosystem is WorldLens-26K, a human-annotated preference dataset containing 26,808 entries, each pairing a numerical score with a free-text rationale (Kong et al., 11 May 2026, Liang et al., 11 Dec 2025). The annotation procedure uses 10 annotators, split into two independent groups, with each annotation taking about 128 seconds on average, for a total exceeding 930 hours (Kong et al., 11 May 2026). The annotation interface displays four synchronized modalities: the generated video, semantic segmentation mask, estimated depth map, and 3D bounding box overlay (Kong et al., 11 May 2026).

From these annotations, the papers derive WorldLens-Agent, a vision-language evaluator trained with LoRA-based supervised fine-tuning on Qwen3-VL-8B or the closely related Qwen3-VL / Qwen2.5-VL family (Kong et al., 11 May 2026, Liang et al., 11 Dec 2025). Its input comprises the video and synchronized auxiliary modalities; its output is both a dimension-specific score and a free-text explanation (Kong et al., 11 May 2026). This design is intended to make large-scale evaluation scalable, automatic, and explainable.

6. Empirical findings, limitations, and relation to adjacent “lens” research

The empirical results of the driving-benchmark literature are notable for how little they validate appearance-only evaluation. The papers state that open-loop PDMS scores are roughly 71%–79%, while closed-loop Route Completion drops to 6%–14%, with RLGF cited at 13.51% (Kong et al., 11 May 2026). Human realism scores remain around 2–3 out of 10 across the four human dimensions (Kong et al., 11 May 2026). These results support the benchmark’s central thesis that visually convincing videos are not necessarily usable worlds.

Model-specific comparisons reinforce that conclusion. OpenDWM is described as strongest in Subject Fidelity and Subject Coherence, with 36.30 and 83.13 respectively, but weaker in geometry and downstream utility (Kong et al., 11 May 2026). DiST-4D is reported to dominate in Perceptual Discrepancy, Cross-View Consistency, and all Reconstruction metrics, with values such as 58.08 for perceptual discrepancy, 389.78 for cross-view consistency, 0.066 photometric error, 0.080 geometric discrepancy, and 43.09% novel-view quality (Kong et al., 11 May 2026). DriveDreamer-2 is strongest in some semantic and geometric metrics, such as 85.91% semantic consistency and 0.073 geometric discrepancy (Kong et al., 11 May 2026). These concrete numbers matter because they show that benchmark leadership depends sharply on which aspect is measured.

Both WorldLens systems also have explicit limitations. In HY-World 2.0, WorldLens is not isolated in a dedicated technical section, and the paper does not provide standalone FPS or latency numbers (HY-World et al., 15 Apr 2026). In the driving benchmark, the authors note that the framework currently focuses on driving scenarios, so extending it to indoor, aerial, or humanoid settings would require new task-specific metrics and cues (Liang et al., 11 Dec 2025). WorldLens-26K may also inherit annotator bias, and WorldLens-Agent inherits the limitations of both its base model and its supervision (Liang et al., 11 Dec 2025).

The surrounding literature clarifies why the name “WorldLens” is technically resonant. In “Lensing Machines: Representing Perspective in Latent Variable Models”, a lens is a mapping between machine-learned latent-variable distributions and human semantic descriptions, enabling perspective-aware models through a mixed-initiative loop (Dinakar et al., 2022). In “See What I Mean? Mobile Eye-Perspective Rendering for Optical See-through Head-mounted Displays”, the central problem is that camera-view understanding must be re-rendered from the user’s eye perspective, and the paper compares Plane-Proxy EPR, Mesh-Proxy EPR, and Gaze-Proxy EPR as software-based solutions (Emsenhuber et al., 15 Sep 2025). In LenslessPiCam, imaging is performed without a lens and reconstructed computationally through a PSF-based forward model and inverse optimization (Bezzam et al., 2022). In “Removing fluid lensing effects from spatial images”, machine learning is used as a proof of concept to remove fluid lensing distortions from shallow-water imagery (Sabella, 2022). These works do not define WorldLens directly, but they illustrate a broader technical pattern in which a “lens” mediates between raw representation and usable perception.

Taken together, the literature supports a compact synthesis. WorldLens names systems that make worlds usable—either by rendering them interactively, or by evaluating whether they deserve to be treated as coherent worlds at all. In HY-World 2.0, usability means navigable 3DGS scenes with lighting control, collision handling, and character support (HY-World et al., 15 Apr 2026). In the driving-world-model benchmark, usability means world fidelity measured across visual realism, 4D geometry, planner compatibility, downstream perception performance, and human judgment (Kong et al., 11 May 2026, Liang et al., 11 Dec 2025). The term therefore occupies a meaningful position at the intersection of rendering, simulation, evaluation, and perspective-aware computation.