Papers
Topics
Authors
Recent
Search
2000 character limit reached

GOT-Edit: Semantic Edit Representations & Tracking

Updated 3 July 2026
  • GOT-Edit is a multi-faceted framework that applies vectorized edit representations and unbalanced optimal transport for effective grammatical error correction across text and visual modalities.
  • It leverages explicit chain-of-thought reasoning combined with multi-guidance diffusion for semantically precise visual generation and editing in interactive settings.
  • The system integrates semantic and geometric features via online model editing with null-space constraints to enhance object tracking accuracy under occlusions and clutter.

GOT-Edit is the title for multiple advanced methodologies across computer vision and language technologies, notably including: (i) the UOT-ERRANT metric for grammatical error correction (GEC) evaluation via optimal transport over semantic edit representations (Goto et al., 5 Feb 2026); (ii) a reasoning-driven visual generation and editing pipeline based on explicit chain-of-thought (GoT) representations (Fang et al., 13 Mar 2025); and (iii) a geometry-aware object tracking framework implementing online model editing with semantic and geometric cues (Chen et al., 9 Feb 2026). This article focuses on the rigorous technical formulation, pipelines, and empirical findings associated with each major GOT-Edit methodology as they appear in the arXiv literature.

1. Edit Representation and GEC Evaluation: UOT-ERRANT

UOT-ERRANT (also referenced as GOT-Edit in some contexts) is an evaluation metric for grammatical error correction that departs from simple surface overlap and embedding similarity by focusing on the induced sentence edits. For a source sentence SS, ERRANT-style edits E={e1,,eE}\mathcal{E} = \{e_1, \ldots, e_{|\mathcal{E}|}\} are extracted for both hypothesis and reference corrections.

Edit Vector Definition:

For each edit eEe \in \mathcal{E}, its vector is the semantic delta induced by its application (or removal) with respect to the encoded sentence— V(e,E,S)=Enc(SE)Enc(SE{e})V(e, \mathcal{E}, S) = \mathrm{Enc}(S_{\mathcal{E}}) - \mathrm{Enc}(S_{\mathcal{E} \setminus \{e\}}) where Enc()\mathrm{Enc}(\cdot) is a sentence encoder (e.g., mean-pooled BERT/ELECTRA), SES_{\mathcal{E}} the fully edited sentence, and SE{e}S_{\mathcal{E} \setminus \{e\}} with ee reverted.

Unbalanced Optimal Transport Formulation:

Edit alignment between hypothesis ({vihyp}i=1n\{v_i^{\mathrm{hyp}}\}_{i=1}^n) and reference ({vjref}j=1m\{v_j^{\mathrm{ref}}\}_{j=1}^m) edit vectors proceeds via unbalanced optimal transport (UOT):

  • Edit "mass" vectors: E={e1,,eE}\mathcal{E} = \{e_1, \ldots, e_{|\mathcal{E}|}\}0, E={e1,,eE}\mathcal{E} = \{e_1, \ldots, e_{|\mathcal{E}|}\}1
  • Cost matrix: E={e1,,eE}\mathcal{E} = \{e_1, \ldots, e_{|\mathcal{E}|}\}2
  • KL-relaxed, entropy-regularized optimization:

E={e1,,eE}\mathcal{E} = \{e_1, \ldots, e_{|\mathcal{E}|}\}3

Solution employs Schmitzer's stabilized Sinkhorn algorithm as implemented in POT.

Soft Match Scoring:

Given transport plan E={e1,,eE}\mathcal{E} = \{e_1, \ldots, e_{|\mathcal{E}|}\}4,

  • True positives: E={e1,,eE}\mathcal{E} = \{e_1, \ldots, e_{|\mathcal{E}|}\}5
  • False positives: E={e1,,eE}\mathcal{E} = \{e_1, \ldots, e_{|\mathcal{E}|}\}6
  • False negatives: E={e1,,eE}\mathcal{E} = \{e_1, \ldots, e_{|\mathcal{E}|}\}7

Precision, recall, and E={e1,,eE}\mathcal{E} = \{e_1, \ldots, e_{|\mathcal{E}|}\}8 are then computed in standard form.

Empirical Evaluation:

UOT-ERRANT achieves Pearson E={e1,,eE}\mathcal{E} = \{e_1, \ldots, e_{|\mathcal{E}|}\}9, Spearman eEe \in \mathcal{E}0 on SEEDA-E Base and leads in average ranking over edit-level metrics. In +Fluency evaluations, its soft alignment provides notably superior ranking of system outputs with diverse edits (Goto et al., 5 Feb 2026).

2. Reasoning-Driven Visual Generation and Editing: GoT-Edit

GoT-Edit refers to a paradigm where explicit language-based reasoning—"chain-of-thought" reasoning chains—guides both image generation and editing. This enables multi-stage, semantically precise, and spatially grounded edits, especially in instruction-guided visual tasks (Fang et al., 13 Mar 2025).

Pipeline Stages:

  1. Chain Generation: Qwen2.5-VL MLLM generates a sequence eEe \in \mathcal{E}1 of stepwise, natural-language reasoning statements, each paired with bounding-box coordinates. For editing, the input is eEe \in \mathcal{E}2; for generation, a prompt eEe \in \mathcal{E}3.

eEe \in \mathcal{E}4

  1. Semantic and Spatial Guidance Extraction:

From eEe \in \mathcal{E}5, obtain: - Semantic embeddings eEe \in \mathcal{E}6 from MLLM cross-attention. - Spatial features eEe \in \mathcal{E}7 via mask-encoding bounding box regions. - Reference image embeddings eEe \in \mathcal{E}8 via a VAE encoder.

  1. Multi-Guidance Diffusion: An SDXL-style U-Net diffusion model integrates eEe \in \mathcal{E}9 through cross-attention and conditioning, employing classifier-free guidance:

V(e,E,S)=Enc(SE)Enc(SE{e})V(e, \mathcal{E}, S) = \mathrm{Enc}(S_{\mathcal{E}}) - \mathrm{Enc}(S_{\mathcal{E} \setminus \{e\}})0

with V(e,E,S)=Enc(SE)Enc(SE{e})V(e, \mathcal{E}, S) = \mathrm{Enc}(S_{\mathcal{E}}) - \mathrm{Enc}(S_{\mathcal{E} \setminus \{e\}})1 during editing.

Interactive Editing:

Any reasoning chain step can be interactively revised (e.g., object location, attribute, or object identity), and new edits are generated without retraining by recalculating V(e,E,S)=Enc(SE)Enc(SE{e})V(e, \mathcal{E}, S) = \mathrm{Enc}(S_{\mathcal{E}}) - \mathrm{Enc}(S_{\mathcal{E} \setminus \{e\}})2 and rerunning the diffusion model.

Empirical Benchmarks:

GoT-Edit achieves 0.64 on GenEval overall, CLIP-I 0.864, and CLIP-T 0.276 on Emu-Edit, outperforming prior generalist approaches and offering competitive coverage across edit types (Fang et al., 13 Mar 2025).

3. Geometry-Aware Object Tracking with Online Model Editing

GOT-Edit in the context of object tracking introduces a framework where semantic and geometric cues from 2D video frames are jointly exploited via online editing of predictor weights, enabling robustness to occlusions and clutter without requiring depth sensors (Chen et al., 9 Feb 2026).

Model Components and Fusion:

  • Semantic Features: Extracted from a frozen DINOv2-L backbone.
  • Geometric Features: Extracted using a Visual Geometry Grounded Transformer (VGGT), pre-trained for monocular geometric tasks (pose, dense point, and depth estimation), producing intermediate spatial feature maps.

Online Model Editing via Null-Space Constraint:

The method draws on AlphaEdit’s associative memory formulation. Tracking weights V(e,E,S)=Enc(SE)Enc(SE{e})V(e, \mathcal{E}, S) = \mathrm{Enc}(S_{\mathcal{E}}) - \mathrm{Enc}(S_{\mathcal{E} \setminus \{e\}})3 are decomposed into semantic V(e,E,S)=Enc(SE)Enc(SE{e})V(e, \mathcal{E}, S) = \mathrm{Enc}(S_{\mathcal{E}}) - \mathrm{Enc}(S_{\mathcal{E} \setminus \{e\}})4 and geometry-driven perturbation V(e,E,S)=Enc(SE)Enc(SE{e})V(e, \mathcal{E}, S) = \mathrm{Enc}(S_{\mathcal{E}}) - \mathrm{Enc}(S_{\mathcal{E} \setminus \{e\}})5 components. The perturbation V(e,E,S)=Enc(SE)Enc(SE{e})V(e, \mathcal{E}, S) = \mathrm{Enc}(S_{\mathcal{E}}) - \mathrm{Enc}(S_{\mathcal{E} \setminus \{e\}})6 is projected into the null space of semantic features to obtain V(e,E,S)=Enc(SE)Enc(SE{e})V(e, \mathcal{E}, S) = \mathrm{Enc}(S_{\mathcal{E}}) - \mathrm{Enc}(S_{\mathcal{E} \setminus \{e\}})7:

V(e,E,S)=Enc(SE)Enc(SE{e})V(e, \mathcal{E}, S) = \mathrm{Enc}(S_{\mathcal{E}}) - \mathrm{Enc}(S_{\mathcal{E} \setminus \{e\}})8

where V(e,E,S)=Enc(SE)Enc(SE{e})V(e, \mathcal{E}, S) = \mathrm{Enc}(S_{\mathcal{E}}) - \mathrm{Enc}(S_{\mathcal{E} \setminus \{e\}})9 is constructed from near-zero singular vectors of the regularized Gram matrix of semantic features after covariance whitening.

Tracking Pipeline:

  1. Align and fuse semantic/geometric features with a learned spatial gating mask.
  2. Transformer-based predictor outputs both Enc()\mathrm{Enc}(\cdot)0 and geometry perturbation Enc()\mathrm{Enc}(\cdot)1.
  3. Final weights Enc()\mathrm{Enc}(\cdot)2 are used for localization.
  4. Box regression outputs bounding box offsets.

Training and Evaluation:

Loss combines hinge-based classification and Generalized IoU regression. GOT-Edit is trained on LaSOT, GOT10k, TrackingNet, COCO; tested across AVisT, NfS, OTB, VOT2020/2022, and others, consistently showing 2–3% SUC gains over baselines.

Efficiency:

At 378×378 resolution, total per-frame runtime is ~127 ms (8 fps), with the editing procedure accounting for ~17 ms overhead.

4. Practical Implementation Details

UOT-ERRANT/GOT-Edit

  1. Extract ERRANT edits from source to hypothesis and source to reference.
  2. Generate each edit vector using the sentence encoder delta.
  3. Construct mass vectors and cost matrix.
  4. Solve the UOT problem using the stabilized Sinkhorn algorithm available in the POT library.
  5. Compute soft scores for true positives, false positives, and false negatives from the transport matrix.
  6. Combine into Enc()\mathrm{Enc}(\cdot)3 score, selecting the maximal value across multiple references.

GoT-Edit

Algorithm 1 (Editing Inference):

Enc()\mathrm{Enc}(\cdot)7

GOT-Edit Tracker

  • Extract and align semantic and VGGT geometric features.
  • Fuse using spatial gating masks.
  • Stack and encode features with positional embeddings, predict localization weights and box regression.
  • Apply null-space projection and combine weights for localization.

5. Interpretability and Generalization Potential

UOT-ERRANT/GOT-Edit:

Provides an interpretable transport matrix Enc()\mathrm{Enc}(\cdot)4 as a soft alignment between edits, where off-diagonal or fractional entries reveal semantic proximity between noisy hypotheses and references. This interpretability supports system diagnostics and linguistic analysis.

Potential for generalization extends to any domain with localized “edit” operations, not limited to GEC—applicable to text simplification, ASR correction, and image editing (with alternate encoders capturing the edit effect).

GoT-Edit:

Editing pipeline is highly interpretable due to explicit reasoning chains and supports direct user manipulation for targeted image synthesis or editing tasks.

GOT-Edit Tracker:

Soft fusion and null-space constraints render the contribution of geometric and semantic cues explicit, allowing inspection of which signal domains are dominating tracker adaptation under different visual scenarios.

6. Empirical Benchmarks and Limitations

Quantitative Performance Summary

System Task Key Metrics (Test Set) Value/Improvement
UOT-ERRANT GEC eval Enc()\mathrm{Enc}(\cdot)5, Enc()\mathrm{Enc}(\cdot)6 (SEEDA) +Fluency domain, improved ranking
GoT-Edit Visual editing GenEval overall 0.64 vs. 0.63 (JanusFlow)
GOT-Edit Tracking SUC (LaSOT) 79.8% (+2.3); OP50: 73.7%

Known Limitations

  • UOT-ERRANT: Exact performance depends on encoder choice; extension beyond ERRANT requires calibration.
  • GoT-Edit: Dataset curation can produce imperfect chains; high compute requirements for large-scale training and interactive diffusion.
  • GOT-Edit tracker: Fusion is contingent on properly trained spatial gating; full robustness in extremely cluttered/occluded scenarios is an open challenge.

7. Research Outlook and Extensions

GOT-Edit, in all its forms, exemplifies a trend towards tightly integrating semantic reasoning, explicit edit modeling, and robust cross-modality fusion in both language and vision tasks. Suggested avenues of extension include memory-augmented MLLMs for extended reasoning consistency, chain manipulation modules for enhanced controllability, and extension of geometric reasoning to higher-order cues (polygon masks, 3D segmentation). The generalizability of vector-based edit modeling to domains such as ASR correction and vision further underscores its foundational methodological significance (Goto et al., 5 Feb 2026, Fang et al., 13 Mar 2025, Chen et al., 9 Feb 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GOT-Edit.