Papers
Topics
Authors
Recent
Search
2000 character limit reached

Geometry-guided Cross-View Attention

Updated 4 July 2026
  • Geometry-guided Cross-view Attention (GCA) is an approach where cross-view interactions are enhanced using explicit geometric cues like depth maps, epipolar lines, and semantic meshes.
  • GCA refines traditional attention by replacing or constraining appearance-based queries with geometry-derived signals to better align multi-view correspondences.
  • It has demonstrated improvements in tasks such as sparse-view reconstruction, satellite localization, anomaly detection, multi-view face generation, and 3D texture synthesis.

Geometry-guided Cross-view Attention (GCA) denotes a family of attention mechanisms in which cross-view token interactions are constrained, parameterized, or supervised by explicit geometric structure rather than left to unstructured full attention. Across sparse-view reconstruction, ground-to-satellite localization, industrial anomaly detection, multi-view face generation, and 3D texture synthesis, the central objective is the same: reduce erroneous cross-view matching by anchoring attention to depth-induced correspondences, epipolar lines, projected viewing frusta, semantic mesh parts, or canonical 3D coordinates (Cao et al., 12 May 2026, Shi et al., 2023, Lentsch et al., 2022, Liu et al., 14 Mar 2025, Choi et al., 26 Jun 2026, Liu et al., 26 Nov 2025).

1. Historical emergence and problem setting

The term has been used in multiple research lines that share a common diagnosis of naïve cross-view attention. In sparse-view reconstruction, standard multi-view self-attention can fail because corrupted target renderings provide unreliable queries; GeoQuery names this failure mode “query contamination” and attributes inconsistent refinement to erroneous cross-view retrieval from damaged query features (Cao et al., 12 May 2026). In 3D texture generation, CaliTex attributes cross-view inconsistency to “attention ambiguity,” where unstructured full attention across tokens and modalities produces geometric confusion and unstable appearance-structure coupling (Liu et al., 26 Nov 2025). In ground-to-satellite localization, coarse retrieval-based matching is limited by the sampling density of database satellite images, motivating geometry-guided refinement of relative rotation and translation (Shi et al., 2023). In industrial anomaly detection, purely data-driven cross-view attention is reported to disregard the geometric properties of multi-camera systems (Liu et al., 14 Mar 2025).

These formulations differ in task and architecture, but they converge on the same technical premise: view correspondence should not be inferred solely from appearance similarity when reliable geometric constraints are available. A plausible implication is that GCA is best understood not as a single module, but as a design principle for restricting the admissible attention graph.

Domain Geometry signal Attention modification
Sparse-view reconstruction (Cao et al., 12 May 2026) depth maps + camera poses proxy queries + local-window cross-view attention
Ground-to-satellite localization (Shi et al., 2023) projection model from overhead to ground scene-specific local MHCA
Cross-view pose estimation (Lentsch et al., 2022) HFoV slice masks ground-guided aerial reweighting + geometry-guided pooling
Industrial anomaly detection (Liu et al., 14 Mar 2025) fundamental matrix + epipolar lines masked cross-attention
Multi-view face generation (Choi et al., 26 Jun 2026) canonical UV position map cross-attention alignment loss
3D texture generation (Liu et al., 26 Nov 2025) semantic mesh parts + geometry condition part-aligned and condition-routed attention

2. Core technical patterns

A recurrent pattern is to replace appearance-derived queries with geometry-derived queries. In GeoQuery, the reference pixel uru_r is back-projected to 3D with metric depth Dr(ur)D^r(u_r), transformed by the relative camera pose, and projected into the target plane to obtain a dense correspondence field Ctr(ut)\mathcal C_{t\to r}(u_t) and validity mask Mtr(ut)M_{t\to r}(u_t). A proxy feature is then sampled from the reference feature map FrF^r as

Frt(ut)=Mtr(ut)Sample(Fr,Ctr(ut)),F^{r\to t}(u_t)=M_{t\to r}(u_t)\odot \mathrm{Sample}\bigl(F^r,\mathcal C_{t\to r}(u_t)\bigr),

and projected into the query space by

Q(ut)=WQFrt(ut).Q(u_t)=W_QF^{r\to t}(u_t).

This proxy query completely replaces the corrupted rendering feature FtF^t as the attention query (Cao et al., 12 May 2026).

A second pattern is to preserve ordinary attention scoring but restrict the support set by geometry. The epipolar attention module for industrial anomaly detection first computes the epipolar line b=Fabpaj\ell_b=F_{ab}^\top p_{aj} from the fundamental matrix and a reference patch center pajp_{aj}, then defines a binary mask

Dr(ur)D^r(u_r)0

where Dr(ur)D^r(u_r)1 is the distance from candidate patch center Dr(ur)D^r(u_r)2 to Dr(ur)D^r(u_r)3. The attention weights become

Dr(ur)D^r(u_r)4

Only support-view tokens near the epipolar line remain reachable (Liu et al., 14 Mar 2025).

A third pattern uses grouped or routed attention instead of explicit geometric projection. CaliTex partitions a mesh Dr(ur)D^r(u_r)5 into Dr(ur)D^r(u_r)6 semantic parts Dr(ur)D^r(u_r)7, renders part-colored maps for six viewpoints, groups tokens into sets Dr(ur)D^r(u_r)8 according to part overlap, and applies self-attention only within each group: Dr(ur)D^r(u_r)9 Cross-view communication is therefore restricted to tokens sharing at least one part label, while per-view full attention is retained as a separate intra-view term (Liu et al., 26 Nov 2025).

A fourth pattern leaves the attention graph dense but directly supervises it with 3D correspondence. GeoFace extracts appearance tokens Ctr(ut)\mathcal C_{t\to r}(u_t)0 from Ctr(ut)\mathcal C_{t\to r}(u_t)1 views and geometry tokens Ctr(ut)\mathcal C_{t\to r}(u_t)2 from a canonical UV position map, computes bidirectional cross-attention Ctr(ut)\mathcal C_{t\to r}(u_t)3 and Ctr(ut)\mathcal C_{t\to r}(u_t)4, and imposes a bidirectional cross-entropy alignment loss against one-hot correspondence targets obtained from nearest-neighbor matches in 3D space, thresholded at Ctr(ut)\mathcal C_{t\to r}(u_t)5 in canonical FLAME units (Choi et al., 26 Jun 2026).

Taken together, these variants indicate that GCA can operate at three distinct levels: query construction, attention masking, and attention supervision. This suggests that “geometry-guided” does not refer to a single operator, but to where geometry intervenes in the attention pipeline.

3. Diffusion and generative formulations

In diffusion-based sparse-view reconstruction, GeoQuery is implemented as a render-and-refine pipeline built on 3D Gaussian Splatting and a U-Net diffusion backbone. The data flow per diffusion step is explicit: render Ctr(ut)\mathcal C_{t\to r}(u_t)6 from the current 3DGS, extract Ctr(ut)\mathcal C_{t\to r}(u_t)7 and Ctr(ut)\mathcal C_{t\to r}(u_t)8 with the U-Net encoder, build Ctr(ut)\mathcal C_{t\to r}(u_t)9 and Mtr(ut)M_{t\to r}(u_t)0 from precomputed depth and poses, apply GCA to obtain Mtr(ut)M_{t\to r}(u_t)1, decode Mtr(ut)M_{t\to r}(u_t)2, compute losses, and update 3DGS. Cross-view aggregation is confined to a square local window

Mtr(ut)M_{t\to r}(u_t)3

with Mtr(ut)M_{t\to r}(u_t)4 reported as the best FID-versus-complexity trade-off. After geometry-guided aggregation, a learned spatial gate Mtr(ut)M_{t\to r}(u_t)5 fuses Mtr(ut)M_{t\to r}(u_t)6 with the global self-attention branch Mtr(ut)M_{t\to r}(u_t)7, and GCA modules are inserted into the low-resolution blocks of the U-Net (Cao et al., 12 May 2026).

CaliTex develops a closely related but differently named formulation, “geometry-calibrated attention,” for view-coherent 3D texture generation. Its backbone is a two-stage DiT. Stage 1 applies a Single-View DiT independently to each of six views with full cross-attention over concatenated noise, geometry-condition, and reference tokens. Stage 2 concatenates all six views’ noise tokens, all six views’ condition tokens, and one averaged reference token set into a sequence of length Mtr(ut)M_{t\to r}(u_t)8, then processes this sequence through 38 Transformer blocks. Within each block, Condition-Routed Attention (CRA) splits the token set into two overlapping groups: Group-1 Mtr(ut)M_{t\to r}(u_t)9, comprising all geometry-condition tokens and reference tokens, and Group-2 FrF^r0, comprising all noise tokens and geometry-condition tokens. The FrF^r1 branch uses Part-Aligned Attention (PAA) together with per-view intra-view full attention: FrF^r2 Implementation choices are unusually explicit: FLUX.1-Kontext DiT with a LoRA adapter of rank 16; six fixed canonical camera poses; resolution FrF^r3; latent downsample factor FrF^r4; patch size FrF^r5; FrF^r6 tokens per view; PartField clustering with FrF^r7 parts; and Multi-View DiT feature dimension FrF^r8. Training uses 80k meshes from Objaverse-XL and TexVerse, 600 GPU-hours on 8 A100s, and a flow-matching loss. On a held-out suite of FrF^r9 renders per mesh, the model reports lowest FID Frt(ut)=Mtr(ut)Sample(Fr,Ctr(ut)),F^{r\to t}(u_t)=M_{t\to r}(u_t)\odot \mathrm{Sample}\bigl(F^r,\mathcal C_{t\to r}(u_t)\bigr),0 and CLIP-FID Frt(ut)=Mtr(ut)Sample(Fr,Ctr(ut)),F^{r\to t}(u_t)=M_{t\to r}(u_t)\odot \mathrm{Sample}\bigl(F^r,\mathcal C_{t\to r}(u_t)\bigr),1, best semantic fidelity with CMMD Frt(ut)=Mtr(ut)Sample(Fr,Ctr(ut)),F^{r\to t}(u_t)=M_{t\to r}(u_t)\odot \mathrm{Sample}\bigl(F^r,\mathcal C_{t\to r}(u_t)\bigr),2, CLIP-I Frt(ut)=Mtr(ut)Sample(Fr,Ctr(ut)),F^{r\to t}(u_t)=M_{t\to r}(u_t)\odot \mathrm{Sample}\bigl(F^r,\mathcal C_{t\to r}(u_t)\bigr),3, and LPIPS Frt(ut)=Mtr(ut)Sample(Fr,Ctr(ut)),F^{r\to t}(u_t)=M_{t\to r}(u_t)\odot \mathrm{Sample}\bigl(F^r,\mathcal C_{t\to r}(u_t)\bigr),4, user-study ratings of Quality Frt(ut)=Mtr(ut)Sample(Fr,Ctr(ut)),F^{r\to t}(u_t)=M_{t\to r}(u_t)\odot \mathrm{Sample}\bigl(F^r,\mathcal C_{t\to r}(u_t)\bigr),5, GeoAlign Frt(ut)=Mtr(ut)Sample(Fr,Ctr(ut)),F^{r\to t}(u_t)=M_{t\to r}(u_t)\odot \mathrm{Sample}\bigl(F^r,\mathcal C_{t\to r}(u_t)\bigr),6, and MV-Cons Frt(ut)=Mtr(ut)Sample(Fr,Ctr(ut)),F^{r\to t}(u_t)=M_{t\to r}(u_t)\odot \mathrm{Sample}\bigl(F^r,\mathcal C_{t\to r}(u_t)\bigr),7, and an ablation in which pixel-level MV-MSE drops from Frt(ut)=Mtr(ut)Sample(Fr,Ctr(ut)),F^{r\to t}(u_t)=M_{t\to r}(u_t)\odot \mathrm{Sample}\bigl(F^r,\mathcal C_{t\to r}(u_t)\bigr),8 without PAA or Frt(ut)=Mtr(ut)Sample(Fr,Ctr(ut)),F^{r\to t}(u_t)=M_{t\to r}(u_t)\odot \mathrm{Sample}\bigl(F^r,\mathcal C_{t\to r}(u_t)\bigr),9 without CRA to Q(ut)=WQFrt(ut).Q(u_t)=W_QF^{r\to t}(u_t).0 in the full model (Liu et al., 26 Nov 2025).

GeoFace extends the same general direction to multi-view face generation. It employs a dual-stream latent U-Net with shared 3D attention layers across Q(ut)=WQFrt(ut).Q(u_t)=W_QF^{r\to t}(u_t).1 streams: one reference image stream, six target image streams, and one geometry stream representing a canonical FLAME UV position map. At the cross-attention layers, appearance and geometry tokens are flattened and attended jointly; the geometry stream receives a learned camera token Q(ut)=WQFrt(ut).Q(u_t)=W_QF^{r\to t}(u_t).2, while the appearance streams receive Plücker-ray camera embeddings. Geometry-guided cross-attention alignment is supervised only at decoder layer Q(ut)=WQFrt(ut).Q(u_t)=W_QF^{r\to t}(u_t).3, where the token resolution is Q(ut)=WQFrt(ut).Q(u_t)=W_QF^{r\to t}(u_t).4, the token embedding size is Q(ut)=WQFrt(ut).Q(u_t)=W_QF^{r\to t}(u_t).5, and attention uses Q(ut)=WQFrt(ut).Q(u_t)=W_QF^{r\to t}(u_t).6 heads of dimension Q(ut)=WQFrt(ut).Q(u_t)=W_QF^{r\to t}(u_t).7. Training combines RGB denoising loss, geometry denoising loss, and the alignment loss, and inference uses DDIM sampling for 50 steps with classifier-free guidance on both streams. The reported outcome on RenderMe-360 and NeRSemble is improved visual quality and cross-view geometric consistency relative to existing methods (Choi et al., 26 Jun 2026).

These generative formulations show that geometry can enter diffusion attention in at least three ways: as the source of replacement queries, as the partitioning rule for sparse cross-view exchange, or as a training signal that shapes otherwise dense cross-attention maps.

4. Localization and cross-view pose estimation

In ground-to-satellite localization, geometry-guided cross-view attention arises from explicit camera projection. One formulation lifts ground-view features into an overhead map under a 3-DoF pose model with yaw Q(ut)=WQFrt(ut).Q(u_t)=W_QF^{r\to t}(u_t).8, planar translation Q(ut)=WQFrt(ut).Q(u_t)=W_QF^{r\to t}(u_t).9, and assumed scene-point height FtF^t0. The overhead pixel FtF^t1 maps to the ground image through the projection in Equation 1 of the paper, yielding an initial synthesized overhead feature map FtF^t2. A multi-head self-attention block first aggregates overhead context, after which cross-view attention is localized by geometry: for each overhead pixel, the corresponding ground-image column FtF^t3 is known from the projection equation, so keys and values are collected only from the local column neighborhood FtF^t4, with FtF^t5 and the module applied only at the coarsest FtF^t6-resolution feature map. The cross-view update is

FtF^t7

A neural pose optimizer then refines relative rotation and auxiliary translation using two Swin-transformer blocks and two small MLP heads, in a two-iteration coarse-to-fine procedure over three pyramid levels. After rotation alignment, dense translation is estimated by normalized cross-correlation modulated by an uncertainty map FtF^t8, with final probability map FtF^t9. The implementation uses a VGG16 encoder, U-Net decoder, satellite images of b=Fabpaj\ell_b=F_{ab}^\top p_{aj}0, ground images of b=Fabpaj\ell_b=F_{ab}^\top p_{aj}1 for KITTI/Ford or b=Fabpaj\ell_b=F_{ab}^\top p_{aj}2 for Oxford, 5 epochs of training with batch size 3 and learning rate b=Fabpaj\ell_b=F_{ab}^\top p_{aj}3, and inference time of approximately b=Fabpaj\ell_b=F_{ab}^\top p_{aj}4 ms per image on RTX 3090. On KITTI Test1 with b=Fabpaj\ell_b=F_{ab}^\top p_{aj}5 initialization noise, lateral accuracy within b=Fabpaj\ell_b=F_{ab}^\top p_{aj}6 m improves from b=Fabpaj\ell_b=F_{ab}^\top p_{aj}7 to b=Fabpaj\ell_b=F_{ab}^\top p_{aj}8, azimuth within b=Fabpaj\ell_b=F_{ab}^\top p_{aj}9 improves from pajp_{aj}0 to pajp_{aj}1, and removing the uncertainty map reduces lateral@1m from pajp_{aj}2 to pajp_{aj}3 (Shi et al., 2023).

SliceMatch uses a different geometry-guided construction for cross-view pose estimation. The ground descriptor is formed by first applying a self-attention mask pajp_{aj}4, then partitioning the masked ground feature map into pajp_{aj}5 vertical stripes and average-pooling each stripe into a slice descriptor pajp_{aj}6. For the aerial branch, each slice descriptor generates a cosine-similarity map

pajp_{aj}7

which is concatenated to the aerial feature map and passed through pajp_{aj}8 convolutions plus Sigmoid to obtain an attention mask pajp_{aj}9. Geometry enters through precomputed masks Dr(ur)D^r(u_r)00 that describe, for each candidate pose Dr(ur)D^r(u_r)01, which aerial cells fall inside the corresponding horizontal field-of-view slice. Weighted average pooling over Dr(ur)D^r(u_r)02 yields pose-dependent slice descriptors Dr(ur)D^r(u_r)03, and these are concatenated into a pose-specific aerial descriptor Dr(ur)D^r(u_r)04. Candidate pose grids are Dr(ur)D^r(u_r)05 in training and Dr(ur)D^r(u_r)06 in inference, masks are computed offline once, and inference over all poses reduces to large matrix multiplication plus normalization and cosine similarity. With VGG16 or ResNet50 backbones and output channel dimension Dr(ur)D^r(u_r)07, the method reports that cross-view attention improves mean localization error from Dr(ur)D^r(u_r)08 m to Dr(ur)D^r(u_r)09 m, median error from Dr(ur)D^r(u_r)10 m to Dr(ur)D^r(u_r)11 m, and median orientation from Dr(ur)D^r(u_r)12 to Dr(ur)D^r(u_r)13. On VIGOR with VGG16, it achieves median localization Dr(ur)D^r(u_r)14 m versus Dr(ur)D^r(u_r)15 m for the best previous global-descriptor method and median orientation Dr(ur)D^r(u_r)16 versus Dr(ur)D^r(u_r)17; runtime exceeds Dr(ur)D^r(u_r)18 FPS on a Tesla V100, and GCA pooling plus pose scoring costs approximately Dr(ur)D^r(u_r)19 ms per pair (Lentsch et al., 2022).

Both localization systems demonstrate a distinctive form of GCA: geometry does not merely regularize attention weights after the fact, but determines where cross-view comparison is even meaningful.

5. Epipolar-constrained fusion for anomaly detection

The epipolar attention module for multi-view industrial anomaly detection offers one of the clearest formulations of geometry as an attention mask. Given calibrated or uncalibrated views, a Dr(ur)D^r(u_r)20 fundamental matrix Dr(ur)D^r(u_r)21 is estimated from point correspondences via the normalized eight-point algorithm with rank-2 enforcement. The epipolar constraint

Dr(ur)D^r(u_r)22

implies that a patch center in the support view must lie near the line Dr(ur)D^r(u_r)23. This line-level geometry is converted into a binary patch mask Dr(ur)D^r(u_r)24, which then filters a single-head cross-attention block over DINOv2 tokens (Liu et al., 14 Mar 2025).

The architecture uses a frozen DINOv2 ViT with patch size Dr(ur)D^r(u_r)25, output channels Dr(ur)D^r(u_r)26, and tokens extracted at layer 7, giving Dr(ur)D^r(u_r)27 tokens per view. For each reference view Dr(ur)D^r(u_r)28, one epipolar attention block is applied per support view Dr(ur)D^r(u_r)29, with learned projections Dr(ur)D^r(u_r)30. The resulting fused tokens Dr(ur)D^r(u_r)31 are stored in separate per-view memory banks Dr(ur)D^r(u_r)32, and inference uses nearest-neighbor distances in feature space to assign anomaly scores.

An important aspect of this work is that geometry-guided masking alone is not sufficient. The paper reports that adding the epipolar attention module without pretraining decreases multi-class image-AUROC from Dr(ur)D^r(u_r)33 for PatchCore with DINOv2 backbone to Dr(ur)D^r(u_r)34, because the attention projections are random. Performance improves to Dr(ur)D^r(u_r)35 with DeepSVDD pretraining, Dr(ur)D^r(u_r)36 with multi-center pretraining, Dr(ur)D^r(u_r)37 with multi-center pretraining plus negative-sample regularization, and Dr(ur)D^r(u_r)38 for the full system with multi-view memory bank, a Dr(ur)D^r(u_r)39 gain over PatchCore. The pretraining objective combines a compactness loss toward cluster centers with a negative regularization term built from multi-view perturbations; optimization runs for 50 epochs with AdamW at learning rate Dr(ur)D^r(u_r)40 and weight decay Dr(ur)D^r(u_r)41, with Dr(ur)D^r(u_r)42 for the negative term. The benchmark is Real-IAD, with 30 object categories, 5 synchronized camera views, and approximately Dr(ur)D^r(u_r)43K high-resolution images (Liu et al., 14 Mar 2025).

This case is especially instructive because it counters a common simplification: explicit geometry can sharply delimit valid correspondences, but the attention projections still require task-specific pretraining to become useful.

6. Empirical tendencies, limitations, and recurrent misconceptions

Across domains, reported gains are largest when the unconstrained query is unreliable or the correspondence search space is structurally ambiguous. GeoQuery’s ablation on 3-view Mip-NeRF360 reports PSNR Dr(ur)D^r(u_r)44 for global attention only, Dr(ur)D^r(u_r)45 for GCA using rendering-based queries, and Dr(ur)D^r(u_r)46 for GCA using proxy queries, directly supporting the claim that geometry-derived queries outperform corrupted rendering-derived ones. Its region-level PSNR analysis further shows that the gain is concentrated in difficult regions: for high-error pixels Dr(ur)D^r(u_r)47 with Dr(ur)D^r(u_r)48, GeoQuery reaches Dr(ur)D^r(u_r)49 dB versus Dr(ur)D^r(u_r)50 dB for DIFIX3D+ and Dr(ur)D^r(u_r)51 dB for 3DGS (Cao et al., 12 May 2026).

A second tendency is that constrained neighborhoods often outperform unrestricted attention. GeoQuery reports that Dr(ur)D^r(u_r)52 gives the best FID-versus-complexity trade-off, and that larger or unconstrained windows Dr(ur)D^r(u_r)53 degrade performance. The ground-to-satellite transformer uses only a radius-Dr(ur)D^r(u_r)54 column neighborhood at the coarsest scale. CaliTex restricts cross-view interactions to semantically matched parts, while preserving full attention within each view. A plausible implication is that many cross-view settings are not attention-limited in the usual sense; they are correspondence-limited, so increasing the reachable token set can worsen retrieval quality (Cao et al., 12 May 2026, Shi et al., 2023, Liu et al., 26 Nov 2025).

A third tendency is that geometry guidance rarely replaces global reasoning altogether. GeoQuery fuses geometry-guided features with the global self-attention branch through a learned spatial gate rather than discarding the global branch. CaliTex augments part-aligned cross-view attention with intra-view full attention and feed-forward residual processing. The localization transformer still relies on Swin-based global context aggregation and an uncertainty-guided dense translation search after the geometry-guided synthesis step. GeoFace supervises cross-attention at one decoder layer rather than imposing geometric hard constraints at every layer. These designs indicate that geometry acts as a constraint on correspondence, not a substitute for semantic modeling (Cao et al., 12 May 2026, Shi et al., 2023, Choi et al., 26 Jun 2026, Liu et al., 26 Nov 2025).

The limitations reported in the literature are similarly consistent. GeoQuery depends on accurate metric depth and explicit correspondences; in texture-less or highly specular regions, depth may fail, disabling GCA and forcing reliance on the global branch. Large viewpoint gaps or occlusions can produce Dr(ur)D^r(u_r)55 over broad areas, again leaving diffusion to hallucinate without geometry support (Cao et al., 12 May 2026). In the anomaly-detection setting, epipolar masking with untrained projections reduces performance rather than improving it, demonstrating that geometry-aware sparsity can be counterproductive when the feature space is not aligned to the task (Liu et al., 14 Mar 2025). In the ground-to-satellite transformer, the projection model assumes tilt and roll are approximately zero and uses a fixed scene-point height Dr(ur)D^r(u_r)56; this clarifies that the geometry prior is a modeling assumption rather than a complete scene reconstruction (Shi et al., 2023).

One recurrent misconception is that GCA is synonymous with a single architectural block. The literature instead shows several non-equivalent implementations: geometry-induced proxy queries, epipolar masks, local scene-specific windows, frustum-slice pooling, semantic part grouping, and supervised alignment of dense cross-attention maps. What unifies them is not operator form but the principle that cross-view attention should respect the admissible geometry of the scene, camera system, or underlying 3D object.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Geometry-guided Cross-view Attention (GCA).