LIHE: Hyperbolic-Euclidean Framework for WGREC
- The paper introduces LIHE, a novel framework for weakly-supervised referring expression comprehension that splits linguistic instances and applies a hybrid similarity approach.
- It employs a two-stage process using a frozen vision-language model for decomposing expressions and YOLOv3 for anchor extraction to accurately ground multiple or zero targets.
- Experimental results demonstrate that LIHE outperforms existing methods by addressing supervisory ambiguity and semantic collapse, leading to improved generalization in WGREC tasks.
The Linguistic Instance-Split Hyperbolic-Euclidean (LIHE) framework is a two-stage architecture for Weakly-Supervised Generalized Referring Expression Comprehension (WGREC). WGREC is an extension of Weakly-Supervised Referring Expression Comprehension (WREC) that allows natural language queries (“referring expressions”) to refer to zero, one, or multiple objects within an image, using only weak image-level supervision. LIHE uniquely combines a referential decoupling mechanism for instance-split language analysis with a hybrid hyperbolic-Euclidean similarity module (HEMix) to address supervisory signal ambiguity and semantic representation collapse in weakly supervised grounding tasks (Shi et al., 15 Nov 2025).
1. Problem Formulation and Motivating Challenges
WGREC is defined as follows: given an image and an expression , the model must identify a set of bounding boxes with using only weak supervision, i.e., image-text pairs and a binary label that indicates whether refers to any object in .
Two major challenges have historically limited progress in this setting:
- Supervisory Signal Ambiguity: Conventional WREC assumes a single target via a winner-takes-all contrastive loss,
inherently constraining the system to select exactly one referent and failing in multi-target or no-target cases.
- Semantic Representation Collapse: Standard Euclidean contrastive learning, when applied to hierarchically-related concepts, merges child categories (e.g., “left man” vs. “left woman”) via a shared ancestral anchor (e.g., “left person”), diminishing the granularity of category discrimination.
LIHE resolves both issues through a two-stage process: Stage 1—Referential Decoupling to segment expressions and predict target counts; Stage 2—Referent Grounding via the HEMix module to combine Euclidean and hyperbolic representations.
2. Referential Decoupling
The first stage converts a potentially compositional referring expression into single-instance sub-expressions , where can be zero.
This is achieved using a frozen large Vision–LLM (VLM), operated in a zero-shot, prompt-driven paradigm. The prompt is constructed from four components:
| Prompt Component | Function | Example Syntax or Purpose |
|---|---|---|
| General instruction | (task definition) | |
| Output format constraint (output first, then enumerate ) | Ensures structured, non-redundant output | |
| Few in-context examples | Boosts decomposition accuracy | |
| Querying the referring expression: “The referring expression is: {T}” | Closes the prompt context |
Input to the VLM consists of . The VLM produces , with as predicted instance count and as the decomposed phrases. If , Stage 2 is skipped as there are no targets.
The VLM is pre-trained and frozen (no Stage 1 learning). Prompt constraints () enforce unique outputs per referent and mitigate hallucination or redundancy.
3. Referent Grounding and the HEMix Similarity Framework
Stage 2 localizes each decomposed sub-expression within the image, reducing the problem to one-target WREC. The pipeline comprises anchor extraction, joint embedding, and anchor selection via hybrid similarity scoring.
- Anchor Extraction: YOLOv3 (pre-trained on MS-COCO, weights frozen) detects bounding box candidates on input images (resized to ). The top 10% by objectness score are retained as anchors.
- Joint Embedding:
- Visual encoder maps anchor features to .
- Text encoder maps each sub-expression to .
- Two sets of linear projection heads produce Euclidean and hyperbolic embeddings.
- Contrastive Loss: For each positive anchor and sub-expression , with negatives ,
with temperature .
- HEMix: A parametric blend of Euclidean and hyperbolic similarities:
where: - Euclidean: . - Hyperbolic (Lorentz model): Features projected to ; coordinates lifted via and ; Lorentz inner product .
A formal proposition demonstrates that for estimators with correlation , there exists an optimal blend parameter such that the mean squared error of HEMix is lower than that of either similarity alone.
No auxiliary margin or ranking loss is applied beyond .
4. Implementation Details
Major architectural and training design points include:
- Frozen YOLOv3 (MS-COCO), input .
- Text encoder maps sub-expressions (up to 15 tokens) to 512-dimensional vectors.
- Linear projections for both anchor and text features into Euclidean and hyperbolic spaces.
- Hyperbolic mapping via learnable linear projection instead of exponential map for improved stability.
- Training with AdamW optimizer, learning rate , batch size 64, over 25 epochs.
- Hardware: WGREC results obtained on A6000 48 GB; WREC on A100 40 GB.
At inference, the process includes anchor extraction, joint feature embedding, HEMix similarity computation, and threshold-based anchor selection for bounding box prediction.
5. Experimental Results and Analysis
LIHE establishes the first weakly-supervised WGREC baseline and demonstrates substantial performance advantages over prior weakly supervised methods on both generalized and standard REC tasks.
- WGREC Benchmarks (gRefCOCO, Ref-ZOM):
- On gRefCOCO-val: RefCLIP* (17.85/0.0), LIHE (39.61/67.49).
- On Ref-ZOM-val: RefCLIP* (35.78/0.0), LIHE (50.36/97.70).
- LIHE correctly addresses zero-target and multi-target scenarios.
- WREC Benchmarks (RefCOCO, RefCOCO+, RefCOCOg):
- RefCLIP*+HEMix improves RefCLIP* by +1.0–1.5% [email protected] across splits (e.g., RefCOCO-val: 59.88 60.95).
| Method | gRefCOCO-val Precision | Ref-ZOM-val Precision |
|---|---|---|
| RefCLIP* | 17.85/0.0 | 35.78/0.0 |
| LIHE | 39.61/67.49 | 50.36/97.70 |
- Ablation Studies:
- HEMix outperforms Euclidean and hyperbolic similarity alone, with average improvements of +1.53% on WREC, +0.90% on WGREC.
- Omitting in-context examples () in the prompt reduces N-acc from 67.49% to 49.00%.
- Learnable linear projection for hyperbolic embedding results in +5.24% gain on RefCOCO-val and +1.23% on gRefCOCO-val versus exponential mapping.
- Cross-dataset generalization: training on gRefCOCO and testing on RefCOCO+ achieves 43.22% for LIHE vs. 38.91% for RefCLIP*.
- Qualitative Analysis: LIHE achieves correct multi-target grounding under occlusion, identifies no-target cases, but can experience VLM hallucinations, repeated phrase decompositions, and missed fine details.
6. Significance and Broader Context
LIHE represents a methodological advance for weakly supervised image-language understanding where expressions may not uniquely ground to a single object. Its two-stage division—linguistic decomposition via VLM prompt engineering and referential grounding via hybrid geometry similarity—addresses foundational issues in weak supervision and semantic fidelity. The HEMix similarity is agnostic to downstream tasks and demonstrates plug-and-play compatibility, as evidenced by its improvements on standard WREC baselines.
Extensive evaluation on gRefCOCO, Ref-ZOM, RefCOCO, RefCOCO+, and RefCOCOg substantiates both the architectural contributions of LIHE and the broader utility of hyperbolic-Euclidean hybrid representations (Shi et al., 15 Nov 2025).