Image-Selection Relation: Lensing & Vision
- Image-Selection Relation (ISR) is a formalism that isolates image-defining properties by separating geometric focusing from residual lensing deflections.
- In gravitational lensing, ISR underpins mass-sheet transformations that preserve image positions while rescaling magnification and time delays.
- In vision–language evaluation, ISR supports the BISON protocol by providing interpretable, fine-grained metrics for text-to-image matching.
The Image-Selection Relation (ISR) is a formalism used in two distinct contexts: optical gravitational lensing theory and the evaluation of vision–LLMs. In gravitational lensing, ISR provides the mathematical structure underlying image formation, magnification, and time-delay invariance under mass-sheet transformations. In computer vision, ISR underpins the Binary Image Selection (BISON) evaluation protocol for text-to-image matching models, delivering interpretable metrics for fine-grained image–text correspondence. Both applications employ an ISR framework to isolate the image-defining properties of complex mappings from either physical lensing or embedding-based model scoring.
1. Mathematical Structure in Gravitational Lensing
In gravitational lensing, the total ray deflection at position in the deflector plane is given by
where is the (scaled) unlensed source position, and is the geometric distance factor. This separates into two terms: a "geometric focusing" term
and a remainder,
which defines the Image-Selection Relation:
The ISR specifies that, after removing geometric focusing, the image-forming lens must deflect all candidate rays by the same constant vector. With , the relation reads (Gorenstein, 4 Jan 2026).
2. Scaling Symmetry and Connection to the Mass-Sheet Transformation
A key feature of ISR in lensing is its invariance under uniform rescaling:
This symmetry leaves image positions unchanged while scaling magnifications as and time delays as . Restoring the geometric focusing yields the classic Mass-Sheet Transformation (MST): the original mass profile is rescaled and a uniform sheet is added, preserving image locations but rescaling magnification and delays (Gorenstein, 4 Jan 2026). In convergence notation,
demonstrates the transformed profile.
3. ISR in Vision–Language Evaluation: The BISON Protocol
In text-to-image matching, ISR formalizes the evaluation setup where a model selects, for each fine-grained text query , the correct image from a pair of semantically similar candidates. The selection function is:
where is the model's compatibility function. For retrieval systems, with image and text embeddings, while for captioning models, (Hu et al., 2019).
4. Dataset Construction and Properties (COCO-BISON)
The COCO-BISON dataset, derived from the COCO validation split, implements ISR/BISON evaluation in three stages:
| Stage | Description | Examples Retained |
|---|---|---|
| 1 | Pairwise similarity via FastText caption embeddings | 67,564 candidate pairs |
| 2 | Annotator selection of discriminative captions | 61,861 triples (91.6%) |
| 3 | Verification by independent annotators | 54,253 triples (87.7%) |
Key statistics include 54,253 query–image triples, coverage of of COCO-val images, and text query distribution matching that of the training corpus (Hu et al., 2019).
5. ISR’s Advantages Over Traditional Retrieval and Captioning Metrics
Traditional image retrieval metrics, such as Recall@k, and captioning scores (BLEU, CIDEr, METEOR, SPICE), suffer from ambiguous negatives and poor correspondence with human correctness. Notably, 56% of retrieval "errors" are genuine matches, and captioning metrics prefer generic captions. ISR/BISON directly addresses these issues:
- Binary accuracy is defined as:
- The contrasting images are explicitly labeled, and negatives are fine-grained semantic distractors.
- The protocol is interpretable, low variance, and focuses evaluation on fine-grained grounding rather than generic correspondence (Hu et al., 2019).
6. Experimental Protocols and Observed Performance
Experimental results on COCO-BISON demonstrate consistent ordering among retrieval and captioning models, with BISON accuracy universally higher than Recall@1:
| System | Recall@1 | Recall@5 | BISON Accuracy |
|---|---|---|---|
| ConvNet+BoW | 45.19 | 79.26 | 80.48 |
| ConvNet+Bi-GRU | 49.34 | 82.22 | 81.75 |
| Obj+Bi-GRU | 53.97 | 85.26 | 83.90 |
| SCAN i2t | 52.35 | 84.44 | 84.94 |
| SCAN t2i | 54.10 | 85.58 | 85.89 |
For captioning systems, BISON accuracy identifies the gap to human-level matching, where models outperform humans on BLEU/CIDEr but not on BISON accuracy (human: 100%) (Hu et al., 2019). This suggests BISON is more reflective of true visual–textual matching.
7. Compact Algorithm and Theoretical Summary
The ISR/BISON evaluation operates algorithmically as follows:
- For each triple , compute model scores , .
- Predict if , else .
- BISON accuracy is the fraction where .
Formally,
A plausible implication is that ISR in both physical lensing and vision–language modeling naturally isolates the image-defining mechanics (via deflection or compatibility scoring) and makes manifest the core invariance properties or interpretability of model outputs. The geometric–optical origin of image invariance in mass-sheet transformations directly parallels ISR’s role in evaluating model grounding precision.
References
- "Evaluating Text-to-Image Matching using Binary Image Selection (BISON)" (Hu et al., 2019)
- "The Optical Origin of the Mass-Sheet Transformation" (Gorenstein, 4 Jan 2026)