Papers
Topics
Authors
Recent
2000 character limit reached

Image-Selection Relation: Lensing & Vision

Updated 11 January 2026
  • Image-Selection Relation (ISR) is a formalism that isolates image-defining properties by separating geometric focusing from residual lensing deflections.
  • In gravitational lensing, ISR underpins mass-sheet transformations that preserve image positions while rescaling magnification and time delays.
  • In vision–language evaluation, ISR supports the BISON protocol by providing interpretable, fine-grained metrics for text-to-image matching.

The Image-Selection Relation (ISR) is a formalism used in two distinct contexts: optical gravitational lensing theory and the evaluation of vision–LLMs. In gravitational lensing, ISR provides the mathematical structure underlying image formation, magnification, and time-delay invariance under mass-sheet transformations. In computer vision, ISR underpins the Binary Image Selection (BISON) evaluation protocol for text-to-image matching models, delivering interpretable metrics for fine-grained image–text correspondence. Both applications employ an ISR framework to isolate the image-defining properties of complex mappings from either physical lensing or embedding-based model scoring.

1. Mathematical Structure in Gravitational Lensing

In gravitational lensing, the total ray deflection at position bb in the deflector plane is given by

Λ(b)=bsbD\Lambda(b) = \frac{b_s - b}{\mathfrak{D}}

where bsb_s is the (scaled) unlensed source position, and D=DdDds/Ds\mathfrak{D} = D_d D_{ds} / D_s is the geometric distance factor. This separates into two terms: a "geometric focusing" term

Fg(b)=αE(b)=bDF_g(b) = \alpha_E(b) = -\frac{b}{\mathfrak{D}}

and a remainder,

αI(b)=Λ(b)Fg(b),\alpha_I(b) = \Lambda(b) - F_g(b),

which defines the Image-Selection Relation:

αI(b)=bsD.\alpha_I(b) = \frac{b_s}{\mathfrak{D}}.

The ISR specifies that, after removing geometric focusing, the image-forming lens must deflect all candidate rays by the same constant vector. With αI(b)=ψ(b)\alpha_I(b) = \nabla \psi'(b), the relation reads ψ(b)=bs/D\nabla \psi'(b) = b_s / \mathfrak{D} (Gorenstein, 4 Jan 2026).

2. Scaling Symmetry and Connection to the Mass-Sheet Transformation

A key feature of ISR in lensing is its invariance under uniform rescaling:

ϵαI(b)=ϵbsD.\epsilon\,\alpha_I(b) = \epsilon\,\frac{b_s}{\mathfrak{D}}.

This symmetry leaves image positions unchanged while scaling magnifications as μϵ2μ\mu \mapsto \epsilon^{-2} \mu and time delays as ΔtϵΔt\Delta t \mapsto \epsilon \Delta t. Restoring the geometric focusing yields the classic Mass-Sheet Transformation (MST): the original mass profile is rescaled and a uniform sheet is added, preserving image locations but rescaling magnification and delays (Gorenstein, 4 Jan 2026). In convergence notation,

κ~(θ)=ϵκ(θ)+(1ϵ)\tilde{\kappa}(\theta) = \epsilon\,\kappa(\theta) + (1 - \epsilon)

demonstrates the transformed profile.

3. ISR in Vision–Language Evaluation: The BISON Protocol

In text-to-image matching, ISR formalizes the evaluation setup where a model selects, for each fine-grained text query tt, the correct image Ii+I_i^+ from a pair (Ii+,Ii)(I_i^+, I_i^-) of semantically similar candidates. The selection function is:

I^i=argmaxI{Ii+,Ii}s(I,ti)\hat{I}_i = \arg\max_{I \in \{I_i^+, I_i^- \}} s(I, t_i)

where s:I×TRs: I \times T \rightarrow \mathbb{R} is the model's compatibility function. For retrieval systems, s(I,t)=fV(I),fT(t)s(I, t) = \langle f_V(I), f_T(t) \rangle with image and text embeddings, while for captioning models, s(I,t)=logp(tI)/ts(I, t) = \log p(t|I) / |t| (Hu et al., 2019).

4. Dataset Construction and Properties (COCO-BISON)

The COCO-BISON dataset, derived from the COCO validation split, implements ISR/BISON evaluation in three stages:

Stage Description Examples Retained
1 Pairwise similarity via FastText caption embeddings 67,564 candidate pairs
2 Annotator selection of discriminative captions 61,861 triples (91.6%)
3 Verification by independent annotators 54,253 triples (87.7%)

Key statistics include 54,253 query–image triples, coverage of 95.5%\approx 95.5\% of COCO-val images, and text query distribution matching that of the training corpus (Hu et al., 2019).

5. ISR’s Advantages Over Traditional Retrieval and Captioning Metrics

Traditional image retrieval metrics, such as Recall@k, and captioning scores (BLEU, CIDEr, METEOR, SPICE), suffer from ambiguous negatives and poor correspondence with human correctness. Notably, 56% of retrieval "errors" are genuine matches, and captioning metrics prefer generic captions. ISR/BISON directly addresses these issues:

  • Binary accuracy is defined as:

Accuracy=1Ni=1N1[I^i=Ii+]\text{Accuracy} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[\hat{I}_i = I_i^+]

  • The contrasting images are explicitly labeled, and negatives are fine-grained semantic distractors.
  • The protocol is interpretable, low variance, and focuses evaluation on fine-grained grounding rather than generic correspondence (Hu et al., 2019).

6. Experimental Protocols and Observed Performance

Experimental results on COCO-BISON demonstrate consistent ordering among retrieval and captioning models, with BISON accuracy universally higher than Recall@1:

System Recall@1 Recall@5 BISON Accuracy
ConvNet+BoW 45.19 79.26 80.48
ConvNet+Bi-GRU 49.34 82.22 81.75
Obj+Bi-GRU 53.97 85.26 83.90
SCAN i2t 52.35 84.44 84.94
SCAN t2i 54.10 85.58 85.89

For captioning systems, BISON accuracy identifies the gap to human-level matching, where models outperform humans on BLEU/CIDEr but not on BISON accuracy (human: 100%) (Hu et al., 2019). This suggests BISON is more reflective of true visual–textual matching.

7. Compact Algorithm and Theoretical Summary

The ISR/BISON evaluation operates algorithmically as follows:

  • For each triple (ti,Ii+,Ii)(t_i, I_i^+, I_i^-), compute model scores s+=s(Ii+,ti)s^+ = s(I_i^+, t_i), s=s(Ii,ti)s^- = s(I_i^-, t_i).
  • Predict I^i=Ii+\hat{I}_i = I_i^+ if s+ss^+ \geq s^-, else IiI_i^-.
  • BISON accuracy is the fraction where I^i=Ii+\hat{I}_i = I_i^+.

Formally,

D={(ti,Ii+,Ii)}i=1N,I^i=argmaxI{Ii+,Ii}s(I,ti),Accuracy=1Ni=1N1[I^i=Ii+]D = \left\{ (t_i, I_i^+, I_i^-) \right\}_{i=1}^N,\quad \hat{I}_i = \arg\max_{I \in \{I_i^+, I_i^-\}} s(I, t_i),\quad \text{Accuracy} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[\hat{I}_i = I_i^+]

(Hu et al., 2019).

A plausible implication is that ISR in both physical lensing and vision–language modeling naturally isolates the image-defining mechanics (via deflection or compatibility scoring) and makes manifest the core invariance properties or interpretability of model outputs. The geometric–optical origin of image invariance in mass-sheet transformations directly parallels ISR’s role in evaluating model grounding precision.

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Image-Selection Relation (ISR).