Papers
Topics
Authors
Recent
2000 character limit reached

LIHE: Hyperbolic-Euclidean Framework for WGREC

Updated 19 November 2025
  • The paper introduces LIHE, a novel framework for weakly-supervised referring expression comprehension that splits linguistic instances and applies a hybrid similarity approach.
  • It employs a two-stage process using a frozen vision-language model for decomposing expressions and YOLOv3 for anchor extraction to accurately ground multiple or zero targets.
  • Experimental results demonstrate that LIHE outperforms existing methods by addressing supervisory ambiguity and semantic collapse, leading to improved generalization in WGREC tasks.

The Linguistic Instance-Split Hyperbolic-Euclidean (LIHE) framework is a two-stage architecture for Weakly-Supervised Generalized Referring Expression Comprehension (WGREC). WGREC is an extension of Weakly-Supervised Referring Expression Comprehension (WREC) that allows natural language queries (“referring expressions”) to refer to zero, one, or multiple objects within an image, using only weak image-level supervision. LIHE uniquely combines a referential decoupling mechanism for instance-split language analysis with a hybrid hyperbolic-Euclidean similarity module (HEMix) to address supervisory signal ambiguity and semantic representation collapse in weakly supervised grounding tasks (Shi et al., 15 Nov 2025).

1. Problem Formulation and Motivating Challenges

WGREC is defined as follows: given an image II and an expression TT, the model must identify a set of bounding boxes B={b1,...,bk}\mathcal{B}^* = \{b_1, ..., b_k\} with k{0,1,2,...}k \in \{0,1,2,...\} using only weak supervision, i.e., image-text pairs (I,T)(I,T) and a binary label v{0,1}v \in \{0,1\} that indicates whether TT refers to any object in II.

Two major challenges have historically limited progress in this setting:

  • Supervisory Signal Ambiguity: Conventional WREC assumes a single target via a winner-takes-all contrastive loss,

a=arg ⁣maxaϕ(T,I,a),a^\ast = \arg\!\max_a \phi(T,I,a),

inherently constraining the system to select exactly one referent and failing in multi-target or no-target cases.

  • Semantic Representation Collapse: Standard Euclidean contrastive learning, when applied to hierarchically-related concepts, merges child categories (e.g., “left man” vs. “left woman”) via a shared ancestral anchor (e.g., “left person”), diminishing the granularity of category discrimination.

LIHE resolves both issues through a two-stage process: Stage 1—Referential Decoupling to segment expressions and predict target counts; Stage 2—Referent Grounding via the HEMix module to combine Euclidean and hyperbolic representations.

2. Referential Decoupling

The first stage converts a potentially compositional referring expression TT into kk single-instance sub-expressions {t1,...,tk}\{t_1, ..., t_k\}, where kk can be zero.

This is achieved using a frozen large Vision–LLM (VLM), operated in a zero-shot, prompt-driven paradigm. The prompt PP is constructed from four components:

Prompt Component Function Example Syntax or Purpose
PGP_G General instruction (task definition)
PCP_C Output format constraint (output kk first, then enumerate tit_i) Ensures structured, non-redundant output
PEP_E Few in-context examples Boosts decomposition accuracy
PQP_Q Querying the referring expression: “The referring expression is: {T}” Closes the prompt context

Input to the VLM consists of (PG,PC,PE,PQ,I)(P_G, P_C, P_E, P_Q, I). The VLM produces (K,TD)(K, \mathcal{T}_D), with KNK \in \mathbb{N} as predicted instance count and TD\mathcal{T}_D as the decomposed phrases. If K=0K=0, Stage 2 is skipped as there are no targets.

The VLM is pre-trained and frozen (no Stage 1 learning). Prompt constraints (PCP_C) enforce unique outputs per referent and mitigate hallucination or redundancy.

3. Referent Grounding and the HEMix Similarity Framework

Stage 2 localizes each decomposed sub-expression tt within the image, reducing the problem to one-target WREC. The pipeline comprises anchor extraction, joint embedding, and anchor selection via hybrid similarity scoring.

  • Anchor Extraction: YOLOv3 (pre-trained on MS-COCO, weights frozen) detects bounding box candidates on input images (resized to 416×416416 \times 416). The top 10% by objectness score are retained as anchors.
  • Joint Embedding:
    • Visual encoder maps anchor features to fvR512f_v \in \mathbb{R}^{512}.
    • Text encoder maps each sub-expression tt to ftR512f_t \in \mathbb{R}^{512}.
    • Two sets of linear projection heads produce Euclidean (WEV,WET)(W_{EV}, W_{ET}) and hyperbolic (WHV,WHT)(W_{HV}, W_{HT}) embeddings.
  • Contrastive Loss: For each positive anchor fa0if_{a^0}^i and sub-expression tit^i, with negatives {fanj}\{f_{a^n}^j\},

Lc=logexp(HEMix(fa0i,fti)/τ)j=1Nn=0M11¬(i=jn=0)exp(HEMix(fanj,fti)/τ)\mathcal{L}_c = -\log \frac{\exp(\mathrm{HEMix}(f_{a^0}^i, f_t^i)/\tau)}{ \sum_{j=1}^N \sum_{n=0}^{M-1} \mathbb{1}_{\neg(i=j \land n=0)} \exp(\mathrm{HEMix}(f_{a^n}^j, f_t^i)/\tau) }

with temperature τ\tau.

  • HEMix: A parametric blend of Euclidean and hyperbolic similarities:

HEMix(fv,ft)=(1α)SimE(fv,ft)+αSimH(fv,ft),α(0,1),\mathrm{HEMix}(f_v, f_t) = (1-\alpha)\,\mathrm{Sim}_E(f_v, f_t) + \alpha\,\mathrm{Sim}_H(f_v, f_t),\quad \alpha\in(0,1),

where: - Euclidean: SimE(fv,ft)=fvWEV,ftWET\mathrm{Sim}_E(f_v, f_t) = \langle f_v W_{EV}, f_t W_{ET} \rangle. - Hyperbolic (Lorentz model): Features projected to zv,ztz_v, z_t; coordinates lifted via x0v=zv2+κ1x_0^v = \sqrt{\|z_v\|^2 + \kappa^{-1}} and x0t=zt2+κ1x_0^t = \sqrt{\|z_t\|^2 + \kappa^{-1}}; Lorentz inner product SimH=x0vx0t+zv,zt\mathrm{Sim}_H = -x_0^v x_0^t + \langle z_v,z_t \rangle.

A formal proposition demonstrates that for estimators with correlation ρ<1\rho < 1, there exists an optimal blend parameter α(0,1)\alpha^*\in(0,1) such that the mean squared error of HEMix is lower than that of either similarity alone.

No auxiliary margin or ranking loss is applied beyond Lc\mathcal{L}_c.

4. Implementation Details

Major architectural and training design points include:

  • Frozen YOLOv3 (MS-COCO), input 416×416416 \times 416.
  • Text encoder maps sub-expressions (up to 15 tokens) to 512-dimensional vectors.
  • Linear projections for both anchor and text features into Euclidean and hyperbolic spaces.
  • Hyperbolic mapping via learnable linear projection instead of exponential map for improved stability.
  • Training with AdamW optimizer, learning rate 1×1041\times10^{-4}, batch size 64, over 25 epochs.
  • Hardware: WGREC results obtained on A6000 48 GB; WREC on A100 40 GB.

At inference, the process includes anchor extraction, joint feature embedding, HEMix similarity computation, and threshold-based anchor selection for bounding box prediction.

5. Experimental Results and Analysis

LIHE establishes the first weakly-supervised WGREC baseline and demonstrates substantial performance advantages over prior weakly supervised methods on both generalized and standard REC tasks.

  • WGREC Benchmarks (gRefCOCO, Ref-ZOM):
    • On gRefCOCO-val: RefCLIP* (17.85/0.0), LIHE (39.61/67.49).
    • On Ref-ZOM-val: RefCLIP* (35.78/0.0), LIHE (50.36/97.70).
    • LIHE correctly addresses zero-target and multi-target scenarios.
  • WREC Benchmarks (RefCOCO, RefCOCO+, RefCOCOg):
    • RefCLIP*+HEMix improves RefCLIP* by +1.0–1.5% [email protected] across splits (e.g., RefCOCO-val: 59.88 \rightarrow 60.95).
Method gRefCOCO-val Precision Ref-ZOM-val Precision
RefCLIP* 17.85/0.0 35.78/0.0
LIHE 39.61/67.49 50.36/97.70
  • Ablation Studies:
    • HEMix outperforms Euclidean and hyperbolic similarity alone, with average improvements of +1.53% on WREC, +0.90% on WGREC.
    • Omitting in-context examples (PEP_E) in the prompt reduces N-acc from 67.49% to 49.00%.
    • Learnable linear projection for hyperbolic embedding results in +5.24% gain on RefCOCO-val and +1.23% on gRefCOCO-val versus exponential mapping.
    • Cross-dataset generalization: training on gRefCOCO and testing on RefCOCO+ achieves 43.22% for LIHE vs. 38.91% for RefCLIP*.
  • Qualitative Analysis: LIHE achieves correct multi-target grounding under occlusion, identifies no-target cases, but can experience VLM hallucinations, repeated phrase decompositions, and missed fine details.

6. Significance and Broader Context

LIHE represents a methodological advance for weakly supervised image-language understanding where expressions may not uniquely ground to a single object. Its two-stage division—linguistic decomposition via VLM prompt engineering and referential grounding via hybrid geometry similarity—addresses foundational issues in weak supervision and semantic fidelity. The HEMix similarity is agnostic to downstream tasks and demonstrates plug-and-play compatibility, as evidenced by its improvements on standard WREC baselines.

Extensive evaluation on gRefCOCO, Ref-ZOM, RefCOCO, RefCOCO+, and RefCOCOg substantiates both the architectural contributions of LIHE and the broader utility of hyperbolic-Euclidean hybrid representations (Shi et al., 15 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Linguistic Instance-Split Hyperbolic-Euclidean (LIHE).