IRIS Benchmark: Cross-Domain Evaluations

Updated 30 March 2026

IRIS Benchmark is a diverse set of evaluation protocols with public datasets spanning sim-to-real object detection, inverse dynamics, iris localization, and presentation attack detection.
It employs standardized metrics such as mAP, IoU, and ODE residuals to assess algorithm performance across industrial, biometric, and digital pathology domains.
The benchmarks guide future research by emphasizing open-set evaluation, cross-sensor generalization, and practical deployment strategies.

IRIS Benchmark

The term "IRIS Benchmark" encompasses a diverse set of testbeds and evaluation protocols across computer vision, biometrics, digital pathology, physical system identification, and privacy-preserving eye-tracking. This article details the most prominent IRIS benchmarks described in peer-reviewed literature, focusing on those with formal evaluation protocols and openly available datasets, emphasizing rigorous technical details and referencing the underlying primary sources.

1. IRIS in Industrial Object Perception: The Industrial Real-Sim Imagery Set

The Industrial Real-Sim Imagery Set (IRIS) is a domain-anchored benchmark for sim-to-real transfer in industrial object detection and pose estimation, introduced alongside the SynthRender framework. Its core motivation is to provide a standardized platform for assessing object-detection pipelines trained on synthetic data and evaluated on challenging RGB-D imagery acquired under semi-uncontrolled, realistic factory conditions (Araya-Martinez et al., 24 Feb 2026).

Dataset Composition:

32 distinct classes: O-rings, washers, steel balls, split pins, fasteners, and pneumatic fittings.
Real test set: 508 Zivid 2 Plus MR60 RGB-D images, ∼20,000 bounding boxes, spanning controlled lighting, direct sunlight, diverse backgrounds, and robot-mounted scenes.
Synthetic training set: 8,000 high-fidelity renders with physics-based placement, randomized physically-based rendering (PBR) materials, and multi-domain lighting/camera parameters.

Annotation and Data Collection:

Real images: Object placement in various configurations and lighting regimes, manual annotation in COCO/YOLO format.
Synthetic images: CAD models or lower-overhead 3D Gaussian Splatting (3DGS), MeshyAI, or TRELLIS 2D→3D asset pipelines.
Each render includes pixel-aligned RGB, depth, normals, segmentation, 6D poses, and metadata.

Benchmark Task:

Multi-class object detection under sim-to-real transfer. Training is performed solely on synthetic images; evaluation is on the held-out real dataset.

Metrics:

mAP@50 (VOC-style): $\mathrm{mAP}@50 = \frac{1}{C} \sum_{c=1}^C \mathrm{AP}_c|_{\mathrm{IoU} \ge 0.5}$
mAP@50:95: $\mathrm{mAP}@[.50:.05:.95] = \frac{1}{10} \sum_{k=0}^9 \mathrm{AP}|_{\mathrm{IoU}=0.50+0.05k}$
Class-wise precision and recall.

Reported Results:

YOLOv11-m achieves 95.3% mAP@50; few-shot adaptation with 5 real images pushes accuracy to 98.5%. Ablations show that domain-randomized PBR textures and physics-based object layout dominate performance gains.

Benchmark Advances and Guidelines:

Emphasize PBR material randomization to prevent overfitting.
Use physics simulations to arrange objects and generate varied hard negatives.
Exponential sampling of light intensity and color increases robustness.
Most performance benefits accrue by 4,000–6,000 synthetic images.

Data Accessibility:

Complete IRIS benchmark datasets, assets, synthetic images, and annotation tools are publicly released via HuggingFace and GitHub (Araya-Martinez et al., 24 Feb 2026).

2. IRIS for Inverse Dynamics and Equation Discovery from Video

The IRIS benchmark for system identification comprises 220 high-resolution (4K/60fps) monocular videos of real-world dynamical systems, designed to standardize unsupervised physical parameter recovery and governing-equation selection from raw pixel data (Khanbayov et al., 17 Mar 2026).

Dataset Design:

8 categories of physical phenomena: 5 single-body (free-fall, sliding, pendulum, torsional oscillator), 3 multi-body (colliding pendulums, cone arrays).
Each experimental configuration contains 10 trials, with precise ground-truth measurements (geometry, friction, damping), uncertainty estimates, and segmentation masks.

Open ODE Bank:

Eight candidate classes of governing ODEs (second-order, coupled oscillators, nonlinear) are provided; each video is labeled with its ground-truth physics equation family.

Evaluation Protocol:

Five axes:

Parameter estimation accuracy (MAE, variance): $\text{MAE} = \frac{1}{n} \sum_{i=1}^n |\hat{\theta}_i - \theta^*_i|$
Identifiability (parameter gradient norm, ODE residual error): $G_{\gamma}^{(e)} = \|\nabla_{\gamma} L^{(e)}\|_2$
Extrapolation error (distance between predicted dynamics and encoder output on held-out frames).
Robustness to input degradation.
Governing-equation selection accuracy.

Baselines:

Latent-space encoding with an Euler–Cromer physics integrator.
Multi-step rollout loss for longer-horizon supervision (stable only on some single-body systems).
Equation-class selection via CNN classifiers, multi-step prompting of VLMs, path-based oracles.

Performance Insights:

CNN-based identification achieves near-oracle accuracy in-distribution (99.3% on IRIS), while prompt-based methods reach 65–73%.
Parameter estimation is critically sensitive to integration scheme correctness; multi-step losses can destabilize multi-body regime recovery.

Toolkit and Access:

All videos, ground-truth data, ODE bank, code (including robust latent-space and equation-selection baselines), and standardized evaluation scripts are publicly released (Khanbayov et al., 17 Mar 2026).

3. Iris Location Benchmarks in Biometrics

The "IRIS Benchmark" of (Severo et al., 2018) defines a protocol, six annotated datasets, and standard metrics for robust and reproducible evaluation of iris-region detection algorithms in biometric imaging.

Problem Statement:

Find the smallest axis-aligned square bounding box enclosing the iris circle in a given image.

Datasets:

Six NIR or visible-light datasets (IIIT-CLI, NDCLD-15, NDCCL, MobBIOfake, CASIA-IrisV3, BERC-MI), with consistent annotation protocol.
Boxes are drawn by expert annotators; normalization enables cross-sensor comparisons.

Evaluation Metrics:

Intersection over Union (IoU)
Recall, Precision, Accuracy (mask-based)
Inference speed (CPU, GPU), recall-IoU curves

Algorithms Compared:

Daugman’s integro-differential operator (parametric, circle fitting)
HOG + SVM sliding-window detector
Fast YOLO-based deep learning detector (fine-tuned on iris data)

Results:

YOLO models achieve superior IoU (typically 91–99% intra-sensor), real-time GPU inference (0.02s/image); classical methods trail in both accuracy and speed but retain value for training-limited and resource-constrained settings.

Key Recommendations:

Optimal deployment uses YOLO-style detectors on GPU with large, varied datasets; HOG+SVM is viable for low-power scenarios; Daugman’s method is robust but slow. Proposed extensions include visible-light/cross-spectral expansions and cross-validation of annotation variance (Severo et al., 2018).

4. Benchmarks for Iris Presentation Attack Detection (PAD/Liveness)

A major application of IRIS benchmarks is in iris liveness detection and PAD.

4.1 LivDet-Iris Series

LivDet-Iris is the primary open benchmark and competition series for PAD, providing blind evaluation of software-only, algorithmic entries on never-seen NIR iris images with diverse Presentation Attack Instruments (PAIs) (Tinsley et al., 2023, Das et al., 2020).

Dataset Structure:

Seven PAI types in 2023: printouts, textured contact lenses, electronic displays, prosthetics/doll eyes, plus three GAN-synthesized categories (StyleGAN2/3, low-to-high quality).
Test sets: e.g., LivDet-Iris 2023—13,332 ISO-compliant images, 6,500 bona fide, 6,832 PAIs.
No training set is released; methods validated in fully open-set protocols; instructional samples provided.

Metrics:

ISO/IEC 30107-3 definitions:
- Attack Presentation Classification Error Rate (APCER)
- Bona-fide Presentation Classification Error Rate (BPCER)
- Average Classification Error Rate (ACER)
- Weighted vs. equal-weight ACER for per-PAI performance analysis and fairness.

Algorithmic Evaluation:

Attention-based pixel-wise binary supervision (Fraunhofer IGD), two-stream fusion CNNs (BUCEA), Swin Transformer+linear classifier (HDA-IDVC).
Physical PAIs are suppressed to <2% APCER by top teams; synthetic (GAN) attacks yield >30–52% error even for best methods.
Human benchmark: human reviewers are nearly as challenged as algorithms with high-fidelity synthetics (MQ/HQ), with ACER ≈ 40%.

Insights:

Broad training set diversity is decisive for generalization: e.g., MSU PAD Algorithm 2 (APCER 2.76%) outperformed competing entries relying on narrower data (USACH/TOC, 59.1%).
Open-set, cross-sensor evaluation is key; generalization to unseen attack types remains unsolved.
GAN-generated and digital-injection attacks are recognized as the central open challenge for future PAD benchmarks.
Continuous benchmarking is enabled via the BEAT platform, supporting reproducibility and ongoing comparisons (Tinsley et al., 2023, Das et al., 2020).

4.2 Open-Source PAD Baselines

Open-source PAD solutions (BSIF ensemble + SVM, RF, MLP) such as (McGrath et al., 2018) serve as reference methods, trained and evaluated on NDCLD'15 and LivDet-Iris splits. Same-dataset accuracy >99%; cross-dataset generalization (subject/sensor-disjoint) yields 84–87% CCR, matching previous LivDet-Iris winners (McGrath et al., 2018).

4.3 Fusion of 2D/3D Cues

The IRIS benchmark of (Fang et al., 2020) is based on NDCLD’15 and NDIris3D datasets; it evaluates PAD methods fusing BSIF-based 2D textural features with photometric-stereo-based 3D convexity estimates. Open-set protocols (unseen lens types, cross-sensor) are emphasized. OSPAD-fusion achieves 94.8% accuracy (APCER=6.4%, BPCER=4.1%) on LG4000, outperforming standalone 2D/3D methods (Fang et al., 2020).

5. IRIS Benchmarks in Digital Pathology and Privacy

Distinct from biometric and PAD contexts, IRIS may refer to benchmarking in pathology rendering or privacy-preserving eye-tracking.

Digital Pathology Rendering (Iris Core): Evaluates tile buffering, frame rate, and high-fidelity rendering latencies across hardware; median 10 ms buffer for a new field-of-view, 1.39 GB/s throughput, >10× lower latency than previous systems (Landvater et al., 21 Apr 2025).
Privacy-Preserving Iris Obfuscation: Benchmarks blurring, noising, downsampling, rubber-sheet, and style transfer for utility-privacy trade-offs, reporting iris-recognition drop, segmentation/gaze error, and imposter attack risk. No universal winner; optimality is application-dependent (Wang et al., 14 Apr 2025).

6. Significance and Future Directions

The IRIS family of benchmarks—distinct in application domain but unified by public datasets, reproducible metrics, and rigorous protocols—form critical substrates for reproducible progress across industrial vision, dynamic system identification, biometrics, and privacy. Common themes include:

Emphasis on open evaluation and test-set sequestration to ensure fair comparison.
Multi-dimensional performance metrics, often reflecting both utility and robustness.
Persistent challenges: sim-to-real gap, cross-sensor generalization, open-set attacks, and practical deployment considerations.

Across these applications, IRIS benchmarks provide both competitive baselines and actionable guidelines for dataset construction, algorithmic validation, and future research organization. All referenced datasets, source codes, and protocols are publicly available via cited repositories and project pages (Araya-Martinez et al., 24 Feb 2026, Khanbayov et al., 17 Mar 2026, Das et al., 2020, Tinsley et al., 2023).

References:

(Araya-Martinez et al., 24 Feb 2026) SynthRender and IRIS: Open-Source Framework and Dataset for Bidirectional Sim-Real Transfer in Industrial Object Perception
(Khanbayov et al., 17 Mar 2026) IRIS: A Real-World Benchmark for Inverse Recovery and Identification of Physical Dynamic Systems from Monocular Video
(Tinsley et al., 2023) Iris Liveness Detection Competition (LivDet-Iris) -- The 2023 Edition
(Das et al., 2020) Iris Liveness Detection Competition (LivDet-Iris) -- The 2020 Edition
(Severo et al., 2018) A Benchmark for Iris Location and a Deep Learning Detector Evaluation
(Fang et al., 2020) Robust Iris Presentation Attack Detection Fusing 2D and 3D Information
(McGrath et al., 2018) Open Source Presentation Attack Detection Baseline for Iris Recognition
(Wang et al., 14 Apr 2025) Trade-offs in Privacy-Preserving Eye Tracking through Iris Obfuscation: A Benchmarking Study
(Landvater et al., 21 Apr 2025) Iris: A Next Generation Digital Pathology Rendering Engine