Semantic RayIoU in 3D Occupancy Evaluation

Updated 26 November 2025

Semantic RayIoU is a visibility-aware metric that evaluates 3D semantic occupancy by computing IoU along each camera ray.
It overcomes the drawbacks of dense voxel mIoU by focusing on the visible surface, providing depth-consistent penalization and perceptual alignment.
As the primary metric in Occ3D benchmarks, it aggregates per-class performance over varying distance thresholds to enhance evaluation accuracy.

Semantic RayIoU is a visibility-aware intersection-over-union metric for evaluating 3D semantic occupancy predictions, primarily in camera-centric or multi-view environments such as those encountered in autonomous driving. Unlike traditional volumetric IoU computed over dense voxel grids, Semantic RayIoU assesses the overlap between predicted and ground-truth occupancy along each camera ray, aggregating results per semantic class and optionally at varying distance thresholds. This metric has emerged as the principal evaluation criterion in the Occ3D benchmark and its derivatives, due to its ability to align metric incentives with the perceptual quality and relevance of surface predictions as observed from image sensors, rather than rewarding spurious volumetric completion behind visible surfaces.

1. Formal Definition of Semantic RayIoU

Let $\mathcal{R}$ denote the set of all camera rays, and $\mathcal{C} = \{1,\dots,N\}$ the set of semantic classes. Each ray $r \in \mathcal{R}$ passes through an ordered sequence of voxels $V_r$ . For each voxel $v \in V_r$ , the ground-truth and predicted labels are $y^{gt}(v)$ and $\hat{y}(v)$ respectively, both in $\{0\} \cup \mathcal{C}$ (with $0$ denoting free or empty space).

For semantic RayIoU, one computes, for each class $c$ : $\mathrm{IoU}_r(c) = \frac{ \sum_{v \in V_r} [ y^{gt}(v) = c \wedge \hat{y}(v) = c ] } { \sum_{v \in V_r} [ y^{gt}(v) = c \vee \hat{y}(v) = c ] }$ where $[\cdot]$ is the Iverson bracket. The per-class RayIoU is averaged over all rays: $\mathrm{RayIoU}(c) = \frac{1}{|\mathcal{R}|} \sum_{r \in \mathcal{R}} \mathrm{IoU}_r(c)$ The Semantic RayIoU is the mean of per-class RayIoUs: $\mathrm{RayIoU}_{\mathrm{sem}} = \frac{1}{N} \sum_{c=1}^N \mathrm{RayIoU}(c)$ Variants include occupancy RayIoU (collapsing all occupied classes), and dynamic-class RayIoU (averaging over a dynamic class subset) (Lilja et al., 21 Nov 2025, Chen et al., 12 Nov 2024, Wang et al., 14 Sep 2024).

2. Motivation and Advantages Over Voxel-Level mIoU

Voxel-level mIoU evaluates the overlap between predicted and ground-truth occupancy over the dense 3D grid, assigning equal weight to every voxel regardless of its relevance to visible geometry. This approach suffers from several pathologies:

Small misalignments in depth can flip the status of numerous voxels along a ray, producing disproportionate penalties.
Filling in regions behind the first visible surface can artificially inflate IoU.
Background dominance: nearby ground-plane voxels overwhelm the metric at the expense of fine structure.

Semantic RayIoU resolves these by emulating the process of vision sensors: each camera pixel defines a ray, and evaluation occurs strictly along the visible path of each ray. This approach provides:

Depth-consistent penalization: a depth shift is tolerated as long as the first surface is aligned.
Robustness against background or unseen regions: rays traversing only empty space are typically ignored.
Perceptual alignment: metric faithfully reflects accuracy of the visible scene (Liu et al., 2023, Yu et al., 15 Jun 2024, Lilja et al., 21 Nov 2025).

3. Implementation Details and Calculation Procedure

The standard evaluation pipeline proceeds as follows:

For each camera view and each pixel, cast a ray through the 3D grid to obtain an ordered voxel sequence $V_r$ .
For every class $c$ , compute along each ray the intersection and union counts, then compute per-ray IoU.
Average per-ray IoUs first over rays, then over semantic classes to get the final score.
Hardware acceleration via batch vectorization on GPU is routine; empty rays (where both GT and prediction are empty for a class) may be omitted from the denominator (Lilja et al., 21 Nov 2025, Chen et al., 12 Nov 2024).

A representative pseudocode for this operation is as follows:

for each camera view:
  for each pixel (u,v):
    r = cast_ray(cam, u,v)
    V_r = voxels along r
    for c in 1...N:
        inter = sum([y_gt[v]==c and y_pred[v]==c for v in V_r])
        union = sum([y_gt[v]==c or y_pred[v]==c for v in V_r])
        if union > 0:
            store inter/union for averaging

Further refinements can restrict calculation to voxels within a distance threshold, yielding RayIoU at 1m, 2m, or 4m (averaged for the Occ3D-nuScenes benchmark) (Wang et al., 14 Sep 2024, Yu et al., 15 Jun 2024).

4. Metric Variants and Their Role in Current Benchmarks

The Occ3D-nuScenes and related datasets report several variants:

RayIoU@1m, @2m, @4m: restrict evaluation to rays and depth bins within 1, 2, or 4 meters of the sensor origin.
Semantic RayIoU: mean over all classes, typically excluding "free" or "empty" class.
Dynamic RayIoU: mean over dynamic object classes only.
Panoptic RayIoU and Ray-level Panoptic Quality (RayPQ): for combined semantic-instance metrics (Yu et al., 15 Jun 2024).

RayIoU is now the principal semantic occupancy metric in leading works and competitions (Chen et al., 12 Nov 2024, Yu et al., 15 Jun 2024, Wang et al., 14 Sep 2024, Liu et al., 2023, Chen et al., 3 Jul 2025, Lilja et al., 21 Nov 2025).

5. Empirical Properties and Comparative Performance

Empirical studies reveal that Semantic RayIoU provides:

Higher sensitivity to thin and distant structures, evidenced by consistent gains in this metric at longer rays, even when mIoU remains constant (Lilja et al., 21 Nov 2025, Wang et al., 14 Sep 2024, Yu et al., 15 Jun 2024, Chen et al., 3 Jul 2025).
Enhanced alignment with human perception of scene understanding, particularly due to its focus on visible geometry (Liu et al., 2023).
A metric that rewards accurate surface reconstruction and penalizes geometric “puffing” or thick-wall artifacts prevalent in voxel-based training.

Recent results are summarized (Occ3D-nuScenes, 256×704 input):

Method	RayIoU (%)	mIoU (%)	FPS
SparseOcc (2024)	36.1	30.9	15.0
Panoptic-FlashOcc	38.5	31.6	24.2
OPUS	41.2	–	8.2
ALOcc-3D	43.7	38.0	6.0

ALOcc-3D achieved a +2.5% absolute gain over prior best RayIoU, with consistent gains at all distance ranges (Chen et al., 12 Nov 2024).

6. Limitations and Open Challenges

Despite its strengths, Semantic RayIoU presents several limitations:

It only assesses correctness of visible (and possibly occluded) surfaces along each camera ray; semantics or occupancy behind the first visible surface are not measured.
It does not fully capture volumetric completeness, especially in the presence of self-occlusion.
Sparse or subvoxel-accurate field predictors may outperform the ground-truth grid, resulting in metric saturation effects (Lilja et al., 21 Nov 2025).
As currently defined, per-class RayIoUs are averaged; this may underweight rare classes (Lilja et al., 21 Nov 2025, Wang et al., 14 Sep 2024).

Active research seeks to extend the metric to panoptic settings (RayPQ, (Yu et al., 15 Jun 2024)) or to multi-hit ray evaluations that capture more of the underlying 3D structure.

7. Distinction from "Semantic Rays" in Localization

The notion of "semantic ray" as used in floorplan localization (Grader et al., 12 Jul 2025)—namely, discrete, equiangular probes from an unknown camera center with categorical labels (e.g., wall, door)—is structurally similar to the class-labeled voxel traversals underlying Semantic RayIoU. However, (Grader et al., 12 Jul 2025) does not define or employ a RayIoU-style metric. Instead, evaluation is strictly recall-based at predefined spatial and angular tolerances, reflecting the method's focus on pose estimation rather than volumetric reconstruction. No intersection-over-union measure over sets of semantic rays or corresponding weighting is defined or used in that class of problems.

In summary, Semantic RayIoU delivers a rigorous, camera-aligned, class-sensitive evaluation for 3D semantic occupancy models, advancing empirical fidelity in vision-based scene understanding beyond traditional volumetric overlap measures. It is now firmly established as the evaluation standard for contemporary research in large-scale, multi-view semantic occupancy prediction.