IrisFormer: Transformer for VIS Iris Recognition

Updated 6 November 2025

The paper introduces IrisFormer, a transformer-based model that partitions normalized iris images into patches and employs multi-head self-attention with relative positional encoding for robust feature extraction.
It leverages VIS-specific preprocessing such as red-channel extraction and gamma correction to enhance texture details, addressing variability in iris pigmentation and illumination.
Empirical results on UBIRIS.v2 and CUVIRIS show significantly reduced Equal Error Rates compared to CNN and classical methods, demonstrating feasibility for real-time smartphone deployment.

IrisFormer designates a transformer-based architecture for iris recognition in the visible spectrum (VIS), targeting practical deployment on commodity smartphones and seeking to overcome the distinct challenges introduced by VIS imaging, such as variable pigmentation, illumination, and typical off-angle acquisition scenarios. IrisFormer, as detailed in (Venkataswamy et al., 7 Oct 2025), is adapted for VIS iris matching, departing from classical near-infrared (NIR)-centric models and CNN-based pipelines, and demonstrates state-of-the-art performance—especially when image acquisition is standardized and robust segmentation is utilized.

1. Architectural Foundations

IrisFormer is constructed as a patch-level transformer encoder, leveraging global attention for contextual iris feature modeling. The system receives a normalized iris image (typically $512 \times 64$ in polar coordinates following Daugman-style segmentation and rubber-sheet normalization). The pipeline proceeds as follows:

Patch Partitioning: The normalized strip is divided into non-overlapping $16 \times 16$ patches.
Linear Embedding: Each patch is projected into a 384-dimensional vector.
Transformer Encoder: Patch embeddings are input to a 12-layer transformer, utilizing multi-head self-attention for context modeling. Relative positional encoding (RoPE) is integrated to provide invariance to rotational offsets common in iris acquisition.
Feature Matching: Similarity between iris samples is computed via patch-wise cosine similarity, preserving local discriminative texture.
Loss Function: Training employs a margin-based triplet loss:

$\mathcal{L}_{\text{triplet}} = \max \left( 0, d(f(x_a), f(x_p)) - d(f(x_a), f(x_n)) + m \right)$

where $f(\cdot)$ denotes the IrisFormer embedding, $d$ is patch-cosine distance, and $m$ is the inter-class margin.

The transformer is trained to learn robust, global representations within the annular iris texture, potentially capturing micro-structural cues that are attenuated in VIS imaging due to melanin absorption.

2. VIS-Specific Adaptation and Preprocessing

To address the low signal and contrast of non-NIR iris images, IrisFormer explicitly incorporates preprocessing modifications:

Red-Channel Extraction: Post-segmentation, only the red image band is retained. This maximizes texture detail for both darkly-pigmented and lightly-pigmented irides in VIS, exploiting the decreased absorption and higher SNR in the red spectrum for iris patterns.
Gamma Correction: A fixed $\gamma=0.7$ correction is applied to enhance detail across the full dynamic range of iris colors.
Augmentations: During training, random horizontal pixel shifts and random patch masking are used. The former simulates gaze/capture misalignment; the latter increases robustness to occlusion, glare, and partial lens covering.

Unlike prior architectures trained on NIR or mixed-spectrum datasets, all IrisFormer model training is conducted solely on the challenging UBIRIS.v2 VIS dataset, thus ensuring domain adaptation without overfitting to capture devices or protocols.

3. Training Regimen and Protocols

The full training protocol is as follows:

Dataset: UBIRIS.v2, encompassing $>11\,000$ images of 261 subjects, exhibiting strong variability in image blur, occlusion, gaze, and illumination.
Optimization: AdamW optimizer with cosine learning-rate decay.
Supervision: Triplet loss is used, generating anchor-positive-negative triplets from within the dataset.
Evaluation: All-vs-all protocol, with same-iris comparisons considered genuine and cross-subject impostor.
Zero-shot Protocol: For cross-dataset evaluation (e.g., on the CUVIRIS set), the IrisFormer weights are frozen after UBIRIS.v2 training; no tuning is performed on the target set.

The segmentation model preceding IrisFormer, termed LightIrisNet, is a MobileNetV3-based multitask network explicitly designed for lightweight, real-time edge deployment (<10M parameters) and provides normalized polars suitable for robust matcher performance.

4. Benchmarking and Empirical Results

Recognition performance is reported using Equal Error Rate (EER) and True Accept Rate (TAR) at fixed False Accept Rates (FAR). Results reflect both intra- and cross-dataset generalization, including challenging smartphone-acquired data.

Dataset	Method	EER (%)
UBIRIS.v1	IrisFormer	4.15
UBIRIS.v2	IrisFormer	5.12
MICHE-I	IrisFormer	8.78
CUVIRIS	OSIRIS (classical)	0.76
CUVIRIS	IrisFormer (VIS)	0.057

On the CUVIRIS set (ISO/IEC 29794-6–compliant, 752 images, 47 subjects, Samsung S21 Ultra), IrisFormer achieves an EER of $0.057\%$ , outperforming both classical (OSIRIS) and contemporary CNN matchers (e.g., DeepIrisNet2 EER $\approx$ 8.5–8.8\%). The DET curves in the referenced paper show near-zero error for light-eyed subjects by both IrisFormer and OSIRIS, but IrisFormer also markedly improves over handcrafted methods for dark-eyed individuals, suggesting increased robustness to iris pigmentation—a major barrier in VIS approaches.

A plausible implication is that transformer architectures, when provided sufficient VIS domain-specific augmentation and channel selection, close the performance gap previously observed between light-pigmented and dark-pigmented irides in non-NIR scenarios.

5. Comparison to Baseline Methods and Practical Deployment

IrisFormer is contrasted with three main categories:

Classical (hand-crafted): OSIRIS, based on log–Gabor filters and Daugman encoding. While highly effective when quality is controlled, performance degrades on heavily pigmented VIS images, especially with suboptimal lighting.
CNN-based Matchers: DeepIrisNet2 (EER $8.5$– $8.8\%$ on UBIRIS/MICHE) and SCNN ($5$– $7\%$ EER). IrisFormer exhibits substantially lower EER in cross-dataset generalization.
Hybrid and LBP/WLD descriptors: These show 7–15% EER on legacy datasets, illustrating the importance of global attention and spatial coherence modeling in modern approaches.

Deployment for real-world smartphone iris recognition is supported via the lightweight segmentation model, integration with ISO-compliant quality acquisition apps, and open-source release of code, models, and a public subset of the CUVIRIS dataset. This demonstrates that, under controlled acquisition and robust segmentation, transformer matchers are feasible for on-device execution and deliver superior generalizability.

6. Reproducibility, Protocols, and Public Resources

The full pipeline—acquisition app, LightIrisNet segmentation, and IrisFormer matching—is reproducible using publicly released assets:

CUVIRIS data subset: https://dx.doi.org/10.21227/4t90-gk02
Acquisition app and segmentation: https://github.com/naveengv7/IrisQualityCapture, https://github.com/naveengv7/LightIrisNet
IrisFormer VIS matcher: https://github.com/naveengv7/Vis-IrisFormer

Standardized protocols encompass segmentation (ellipse/flexible boundaries), normalization (parameters provided in the original paper), all-vs-all matching, and systematic evaluation splits. All benchmarks are reproducible with these resources, supporting further research and comparative studies.

7. Significance and Implications

IrisFormer demonstrates that transformer-based sequence models, when adapted for VIS-specific characteristics and trained with appropriate augmentation, yield state-of-the-art accuracy for practical smartphone iris recognition, with EER rates that approach those of well-controlled NIR systems and surpass legacy and CNN alternatives in domain transfer. The architecture’s performance across pigmentation and device variation, coupled with public code and data, indicates immediate applicability for biometric authentication tasks where visible-light constraints preclude NIR hardware.

This suggests that future research should focus on further optimizing lightweight transformer variants for edge devices and investigating potential improvements in VIS-domain segmentation and normalization to exploit the full capacity of transformer matchers.

PDF Markdown Chat (Pro)

References (1)

Smartphone-based iris recognition through high-quality visible-spectrum iris image capture.V2 (2025)

IrisFormer: Transformer for VIS Iris Recognition

1. Architectural Foundations

2. VIS-Specific Adaptation and Preprocessing

3. Training Regimen and Protocols

4. Benchmarking and Empirical Results

5. Comparison to Baseline Methods and Practical Deployment

6. Reproducibility, Protocols, and Public Resources

7. Significance and Implications

Whiteboard

Follow Topic

Continue Learning

IrisFormer: Transformer for VIS Iris Recognition

1. Architectural Foundations

2. VIS-Specific Adaptation and Preprocessing

3. Training Regimen and Protocols

4. Benchmarking and Empirical Results

5. Comparison to Baseline Methods and Practical Deployment

6. Reproducibility, Protocols, and Public Resources

7. Significance and Implications

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics