NFGlassNet: Dynamic Glass Detection
- NFGlassNet is a glass surface detection framework that leverages dynamic flash/no-flash reflection cues to overcome limitations of traditional RGB methods.
- It employs a dual Swin-Transformer backbone with specialized modules, RCMM and RGAM, to extract multi-scale features and fuse appearance with reflection cues.
- Experimental results on the NFGD dataset show a significant IoU improvement (+4.21%) over baselines, highlighting its robustness in detecting transparent objects.
NFGlassNet is a glass surface detection framework that leverages reflection dynamics observed in pairs of flash and no-flash images to identify glass regions, overcoming limitations inherent to traditional RGB-based and boundary-dependent methods. By explicitly modeling the physical phenomenon in which reflections on glass surfaces appear or disappear depending on the direction of illumination and the use of flash, NFGlassNet augments feature extraction and attention mechanisms to deliver improved localization of transparent surfaces in complex visual environments (Yan et al., 21 Nov 2025).
1. Problem Context and Physical Principle
The detection of glass surfaces in natural images is inherently challenging due to the lack of color, transparency, and absence of consistent boundary cues in most real-world scenes. Prior approaches typically exploit boundary cues (e.g., window or door frames) or rely on static reflection features. However, these fail in scenarios with frameless glass, partial occlusions, or ambiguous background reflections.
NFGlassNet is predicated on the observation that, when a flash is fired from the brighter side of a glass surface toward a darker region, the pre-existing scene reflections on the glass tend to vanish. Conversely, when a flash is fired from the dark side toward a brighter region, new distinct reflections are induced on the glass. This dynamic, physically rooted contrast is highly discriminative for glass detection and is explicitly exploited as a supervisory signal and feature in NFGlassNet.
2. Architecture Overview
NFGlassNet processes a perfectly aligned flash/no-flash image pair using a dual-encoder backbone and modular aggregation layers as follows:
- Backbone: Two parallel Swin-Transformer V2 encoders (not weight-tied) extract multi-scale feature sets from each image:
- @@@@1@@@@ (RCMM): At each scale , RCMM compares the flash/no-flash features to mine dynamic reflection cues and predicts scale-specific reflection feature maps () and individual reflection masks ().
- Reflection Guided Attention Module (RGAM): Fuses reflection features and appearance features via dual cross-attention, yielding refined glass-centric features ().
- Decoder: A progressive decoder upsamples and merges outputs from all stages to produce the final glass surface segmentation mask ().
This architecture explicitly models the flash-induced divergence in reflection appearance as the principal cue for glass surface localization.
3. Reflection Contrast Mining Module (RCMM)
RCMM consists of two main branches:
- Dilated-Conv Contrast Branch: For both and , four separate 3x3 dilated convolutions (dilations ) generate features at multiple receptive fields. Pairwise summations of these features are computed for all index combinations , and their differences () are concatenated to form a multi-scale reflection contrast descriptor (), capturing the appearance/disappearance of reflections due to flash.
- UNet Prediction Branches: Two compact UNets process and independently to predict complementary reflection maps.
- Feature Fusion: The contrast feature () and UNet features are fused by elementwise (Hadamard) product and addition:
This unified reflection feature is then projected to scale-specific reflection predictions.
- Pseudo-Ground-Truth Generation: Lacking real pixel-level reflection annotations, NFGlassNet synthesizes pseudo-GT by:
- Masking the input images by the ground-truth glass mask,
- Applying a state-of-the-art reflection detector (e.g., LANet) to the masked region,
- Using the output as the supervisory signal for reflection prediction.
- Reflection Loss: The reflection output at each scale is supervised by a combined SSIM and L1 loss:
where .
4. Reflection Guided Attention Module (RGAM)
RGAM fuses the reflection features () and “glass appearance” features () with a dual cross-attention scheme:
- Dual Cross-Attention:
- Querying Reflection by Glass: Generates attention map with , , and .
- Querying Glass by Reflection: Constructs attention map from corresponding swapped roles.
- Shared Attention Map: Both attention maps are shifted (by subtracting their global minima), multiplied element-wise, and softmaxed:
- Feature Aggregation: Both value tensors are reweighted and summed to yield , focusing downstream processing on regions simultaneously likely to belong to glass (per appearance features) and exhibiting reflection changes (per reflection features).
This module enforces that detected glass regions must satisfy both appearance and dynamic reflection criteria.
5. Dataset: NFGD (No-Flash & Flash Glass Dataset)
The NFGD dataset comprises 3,312 perfectly aligned no-flash/flash image pairs, captured with a Canon EOS 80D and controlled flash settings (GN=60 at ISO 100, powers of 1/2, 1/4, 1/8). Sampling covers highly variable scenes—indoors and outdoors, multiple glass types and arrangements, diverse lighting, and camera-to-surface configurations. Alignment is tightly controlled via a custom camera script, preventing parallax or viewpoint inconsistencies. Ground-truth glass masks are pixelwise and manually annotated. The dataset is partitioned 80/20 into training/test splits, with class distribution and region bias closely matched.
6. Training Methodology and Loss Functions
- Input resolution: 384×384.
- Backbones: Dual Swin-Transformer V2 encoders, pretrained on ImageNet.
- Batch size: 2.
- Optimizer: AdamW, learning rate , standard weight decay, 150 epochs (h on RTX 4090).
- Data augmentation: Random cropping, horizontal flipping, rotation.
- Supervision:
- At each decoder layer (), predict the glass mask () and reflection maps ().
- Total loss:
with , and
7. Experimental Results and Limitations
Quantitative Comparison
On the NFGD dataset, NFGlassNet demonstrates metrics superior to prior RGB, RGB-Thermal (Huo et al.), and RGB-NIR (Yan et al.) methods:
| Method | IoU (%) | Fβ | MAE | BER | ACC |
|---|---|---|---|---|---|
| GSDNet (RGB) | 80.21 | 0.887 | 0.094 | 0.108 | 0.887 |
| RGB-T (Huo et al.) | 83.93 | 0.909 | 0.065 | 0.072 | 0.912 |
| RGB-NIR (Yan et al.) | 82.85 | 0.899 | 0.083 | 0.085 | 0.903 |
| NFGlassNet | 86.17 | 0.922 | 0.052 | 0.053 | 0.934 |
Ablation studies confirm that:
- RCMM: Adding RCMM to the baseline increases IoU by +2.89%.
- RGAM: Adding RGAM to the baseline increases IoU by +0.28%.
- Both RCMM & RGAM: Full model improvement is +4.21% IoU over baseline.
- Supervising the reflection prediction in addition to the glass mask is crucial; removing this loss reduces IoU to 82.21% (from 86.17%), and using only reflection loss yields IoU ≈ 73.5%.
Failures are observed predominantly with non-glass specular materials (e.g., glossy tiles), where pure reflection cues are an insufficient discriminant.
8. Implications, Insights, and Extensions
NFGlassNet establishes that dynamic cues—specifically, the appearance and disappearance of reflections under controlled lighting—provide a physically grounded and highly discriminative foundation for glass surface detection. This principled cue supersedes static boundary or RGB-based features, especially in the context of boundaryless or minimal-feature glass.
Potential directions include the integration of polarization or near-infrared modalities to separate refractive glass surfaces from strongly specular, non-glass reflectors, adaptation to real-time video with flash/no-flash toggling, and compensation for illumination variation or camera motion with geometric alignment. The public release of dataset, code, and models is expected to facilitate broader research into transparent object segmentation leveraging active illumination paradigms (Yan et al., 21 Nov 2025).