Joint Geometric Optimization

Updated 4 July 2026

Joint geometric optimization is an approach that simultaneously optimizes geometry and its dependent variables by replacing sequential pipelines with a unified objective.
It employs residual designs and diverse solvers, such as Gauss-Newton and implicit differentiation, to enforce geometric consistency from varied cues like photometric and reprojection errors.
Applications span computer vision, 3D reconstruction, mesh registration, and communications, where directly coupling geometry improves both accuracy and computational efficiency.

Within the surveyed literature, joint geometric optimization denotes formulations in which geometry is optimized together with the variables that both constrain it and depend on it. The coupled variables differ by domain—optical flow and camera motion, depth and flow, camera poses and Gaussian parameters, mesh geometry and vertex colors, object poses and temporal states, feature matches and geometric models, constellation geometry and symbol probabilities, or movable-region shape and antenna positions—but the recurring premise is that geometric consistency should enter the objective directly rather than be enforced only in a separate downstream stage (Jiang et al., 2020, Xiao et al., 4 Jun 2025, Xiong et al., 5 Mar 2026, Cai et al., 6 Nov 2025, Li et al., 2020, Isack et al., 2013, Aoudia et al., 2020, Ye et al., 5 Apr 2026).

1. Canonical objective structures

A central pattern is the replacement of pipeline-style estimation by a single coupled objective. In unsupervised optical flow and egomotion, the formulation is explicitly bi-level: the optical-flow network predicts $V=F(I,I';\theta_F)$ , while camera motion is obtained by a lower-level geometric solver,

$\min_{\theta_F}\;L_{\rm photo}\bigl(V;I,I'\bigr)+\lambda_{\rm geo}L_{\rm epi}\!\bigl(V,T^*(V)\bigr), \qquad T^*(V)=\arg\min_T L_{\rm epi}(V,T).$

The same paper parameterizes the essential matrix as $E(\theta)=[t]_\times R$ and uses epipolar geometry to couple dense flow and relative pose (Jiang et al., 2020).

Bundle-adjustment-style formulations recur in 3D reconstruction. GloSplat jointly optimizes camera poses $\{\mathbf T_i\}$ , Gaussian-splat parameters $\mathcal G$ , and SfM track points $\{\mathbf X_k\}$ through

$\min_{\{\mathbf T_i\},\,\mathcal G,\,\{\mathbf X_k\}} L_{\rm photo}\bigl(\{\mathbf T_i\},\mathcal G\bigr)+\lambda_{\rm BA}L_{\rm BA}\bigl(\{\mathbf T_i\},\{\mathbf X_k\}\bigr),$

where $L_{\rm photo}$ is a rendering loss and $L_{\rm BA}$ is a reprojection loss over preserved feature tracks (Xiong et al., 5 Mar 2026). GPJA uses an analogous additive structure for facial mesh registration,

$L_{\rm total}(\mathbf x)=L_{\rm photo}(\mathbf x)+\lambda_{\rm geo}L_{\rm geo}(\mathbf x)+\lambda_{\rm reg}L_{\rm reg}(\mathbf x),$

with $\min_{\theta_F}\;L_{\rm photo}\bigl(V;I,I'\bigr)+\lambda_{\rm geo}L_{\rm epi}\!\bigl(V,T^*(V)\bigr), \qquad T^*(V)=\arg\min_T L_{\rm epi}(V,T).$ 0 and smoothness enforced by Laplacian-based reparametrization (Wang et al., 2024). Texture-guided Gaussian-mesh joint optimization likewise minimizes

$\min_{\theta_F}\;L_{\rm photo}\bigl(V;I,I'\bigr)+\lambda_{\rm geo}L_{\rm epi}\!\bigl(V,T^*(V)\bigr), \qquad T^*(V)=\arg\min_T L_{\rm epi}(V,T).$ 1

jointly over vertex positions and per-vertex RGB colors (Cai et al., 6 Nov 2025).

Outside vision, the same design appears with different objective functionals. In coded modulation, PS-GS maximizes the bit-wise mutual information under an average power constraint,

$\min_{\theta_F}\;L_{\rm photo}\bigl(V;I,I'\bigr)+\lambda_{\rm geo}L_{\rm epi}\!\bigl(V,T^*(V)\bigr), \qquad T^*(V)=\arg\min_T L_{\rm epi}(V,T).$ 2

where $\min_{\theta_F}\;L_{\rm photo}\bigl(V;I,I'\bigr)+\lambda_{\rm geo}L_{\rm epi}\!\bigl(V,T^*(V)\bigr), \qquad T^*(V)=\arg\min_T L_{\rm epi}(V,T).$ 3 collects geometry, probability, and demapper parameters (Aoudia et al., 2020). In movable-antenna DOA estimation, the objective becomes CRB minimization over both antenna coordinates and the movable region $\min_{\theta_F}\;L_{\rm photo}\bigl(V;I,I'\bigr)+\lambda_{\rm geo}L_{\rm epi}\!\bigl(V,T^*(V)\bigr), \qquad T^*(V)=\arg\min_T L_{\rm epi}(V,T).$ 4 of fixed area $\min_{\theta_F}\;L_{\rm photo}\bigl(V;I,I'\bigr)+\lambda_{\rm geo}L_{\rm epi}\!\bigl(V,T^*(V)\bigr), \qquad T^*(V)=\arg\min_T L_{\rm epi}(V,T).$ 5, under minimum-spacing constraints (Ye et al., 5 Apr 2026). This suggests that “joint geometric optimization” is better understood as an optimization pattern than as a domain-specific algorithm.

2. Geometric couplings and residual design

The coupling is implemented through residuals that make geometry observable from complementary cues. In rigid optical flow and egomotion, the fundamental relation is the epipolar constraint $\min_{\theta_F}\;L_{\rm photo}\bigl(V;I,I'\bigr)+\lambda_{\rm geo}L_{\rm epi}\!\bigl(V,T^*(V)\bigr), \qquad T^*(V)=\arg\min_T L_{\rm epi}(V,T).$ 6, with the one-sided Sampson-like distance used as

$\min_{\theta_F}\;L_{\rm photo}\bigl(V;I,I'\bigr)+\lambda_{\rm geo}L_{\rm epi}\!\bigl(V,T^*(V)\bigr), \qquad T^*(V)=\arg\min_T L_{\rm epi}(V,T).$ 7

Photometric consistency, forward-backward consistency, smoothness, and the epipolar term are then combined in the teacher and student losses (Jiang et al., 2020).

In sparse-view reconstruction, JointSplat defines a per-pixel matching probability $\min_{\theta_F}\;L_{\rm photo}\bigl(V;I,I'\bigr)+\lambda_{\rm geo}L_{\rm epi}\!\bigl(V,T^*(V)\bigr), \qquad T^*(V)=\arg\min_T L_{\rm epi}(V,T).$ 8 from flow-matching confidence and an occlusion mask. The weighted depth-consistency term

$\min_{\theta_F}\;L_{\rm photo}\bigl(V;I,I'\bigr)+\lambda_{\rm geo}L_{\rm epi}\!\bigl(V,T^*(V)\bigr), \qquad T^*(V)=\arg\min_T L_{\rm epi}(V,T).$ 9

makes flow contribute strongly where matches are reliable and suppresses misleading gradients in uncertain areas (Xiao et al., 4 Jun 2025). In stereo 3D object tracking, the joint spatial-temporal objective couples per-pixel local depth, local coordinates, centroid position, yaw, stereo photometric alignment, temporal reprojection, and a pose prior, with all historic cues summarized by a marginalization prior $E(\theta)=[t]_\times R$ 0 (Li et al., 2020).

Point-line pose refinement uses a different residual structure. The point term is the standard 2D reprojection error $E(\theta)=[t]_\times R$ 1. The line term combines midpoint distance and an angle term derived from the sine between observed and reprojected image lines, producing $E(\theta)=[t]_\times R$ 2, with Huber-robustified aggregation over all point and line correspondences (Gao et al., 2021). Joint pose and curvature refinement via quadrics couples pose increments and local surface geometry through residuals of the form

$E(\theta)=[t]_\times R$ 3

so that pose refinement and principal-curvature refinement are solved in one least-squares system (Spek et al., 2017).

An earlier multi-view reconstruction example is joint fitting and matching, where feature assignment and model selection are optimized together. The fit-&-match energies extend the assignment problem by combining geometric transfer error, appearance cost, label costs, and optionally a Potts-style spatial regularizer. This replaces appearance-only pre-matching by a coupled optimization over matches and geometric models (Isack et al., 2013).

3. Optimization procedures and solver architectures

The surveyed methods use markedly different solvers, but a consistent theme is that the optimization machinery is chosen to preserve the coupling rather than factor it away. In the optical-flow/egomotion bi-level model, the lower-level camera-motion estimate is produced by a black-box RANSAC+IRLS solver, yet training remains end-to-end because the stationarity condition of the lower problem is differentiated implicitly: $E(\theta)=[t]_\times R$ 4 This allows back-propagation through the geometric layer independent of its implementation (Jiang et al., 2020).

Several reconstruction systems adopt alternating or staged procedures. JOGS decomposes pose estimation and 3D Gaussian Splatting into two interleaved phases: Gaussian parameters are updated through differentiable rendering with fixed poses, and poses are refined by a customized 3D optical flow algorithm solved by Gauss-Newton normal equations (Li et al., 30 Oct 2025). ProJo4D explicitly rejects both purely sequential estimation and unrestricted full optimization from scratch. It introduces a fixed three-stage schedule—Stage 0 for 4D reconstruction, Stage 1 for initial velocity only, Stage 2 for initial state plus materials, and Stage 3 for full joint optimization—guided by the empirical sensitivity ordering $E(\theta)=[t]_\times R$ 5 (Rho et al., 5 Jun 2025). GPJA uses a multiscale gradient-based solver with coarse-to-fine remeshing and Laplacian-smoothed descent to handle large deformations robustly (Wang et al., 2024).

Classical geometric solvers remain important. Joint spatial-temporal stereo tracking is optimized as a non-linear least squares problem with Huber robustification and per-frame Schur-complement marginalization, keeping $E(\theta)=[t]_\times R$ 6 memory/time per frame while propagating historical information (Li et al., 2020). Joint pose and curvature refinement forms block normal equations with pose-only block $E(\theta)=[t]_\times R$ 7, quadric-only block $E(\theta)=[t]_\times R$ 8, and coupling block $E(\theta)=[t]_\times R$ 9, then solves by Schur complement because $\{\mathbf T_i\}$ 0 is block-diagonal (Spek et al., 2017). Joint fitting and matching alternates between PEARL-based multi-label fitting and min-cost-max-flow assignment solves, with flow recycling used to accelerate repeated re-matching (Isack et al., 2013).

This diversity of solvers is itself informative. Some joint formulations are differentiable end-to-end, some are alternating-direction methods, some are Gauss-Newton or Levenberg-Marquardt systems, and some are network-flow problems. A plausible implication is that the defining feature of joint geometric optimization is not the numerical method but the retention of cross-variable dependence in the objective.

4. Vision and graphics applications

The most concentrated use of joint geometric optimization appears in computer vision and graphics. In rigid-scene optical flow and egomotion, globally enforced geometric constraints improve both outputs: on KITTI-12, the reported “Ours” model achieves mean EPE≈3.3 px (all) vs. 4.1 px baseline and outperforms all prior unsupervised methods by 15–25%; on RGB-D SLAM, EPE is reduced from ≈9.5(px)→7.6(px); and pose estimates obtained by decomposing $\{\mathbf T_i\}$ 1 have average translational error ≈4 % and rotational ≈0.7°/100 m, approaching ORB-SLAM (2.5 %, 0.3°/100 m) (Jiang et al., 2020).

Sparse-view 3D reconstruction has produced several distinct joint formulations. JointSplat trains depth and flow together and then back-propagates their effects into Gaussian centers, covariances, opacities, and colors; on RealEstate10K and ACID it consistently outperforms state-of-the-art methods, with the probabilistic weight $\{\mathbf T_i\}$ 2 used to suppress misleading gradients in uncertain areas (Xiao et al., 4 Jun 2025). GloSplat carries the joint idea further by preserving explicit SfM feature tracks as first-class entities during 3DGS training. GloSplat-F achieves PSNR 27.77 dB vs VGGT-X’s 26.40 dB among COLMAP-free methods, GloSplat-A reaches PSNR 28.86 dB versus 28.19 dB for Improved-GS, removing the BA loss causes –0.81 dB, freezing poses after SfM causes –8.59 dB collapse, and GloSplat-F is reported as 13.3× faster than COLMAP+3DGS on 1000 images in the Courthouse scene (Xiong et al., 5 Mar 2026). JOGS uses a different coupling, alternating 3DGS updates with LK3D pose refinement; on Tanks & Temples it reports mean PSNR=26.91, SSIM=0.88, LPIPS=0.13, compared with 26.80/0.87/0.13 for COLMAP-based 3DGS and much lower scores for COLMAP-free CFGS and GSHT, while on LLFF its pose errors are RPE $\{\mathbf T_i\}$ 3=0.093, RPE $\{\mathbf T_i\}$ 4=0.049, and ATE=0.011 (Li et al., 30 Oct 2025).

Mesh-based registration and refinement exhibit the same pattern. GPJA aligns textured facial meshes with multi-view scans by combining differentiable color, depth, and normal losses with Laplacian-regularized geometry evolution. It reports geometric accuracy of 0.181 mm overall average, versus 0.341 mm for NPHM and 0.797 mm for FaceScape TU, together with PSNR 24.40 dB, SSIM 0.7538, and LPIPS 0.0746 (Wang et al., 2024). Texture-guided Gaussian-mesh joint optimization simultaneously updates mesh geometry and vertex colors via Gaussian-guided mesh differentiable rendering; on DTU, Chamfer distance improves from 1.63 to 1.45 for a 3DGS baseline, from 0.78 to 0.72 for GOF, from 0.77 to 0.70 for 2DGS, and from 0.53 to 0.51 for PGSR, while rendering metrics improve for GOF from 24.81/0.858/0.194 to 25.63/0.897/0.160 and for 2DGS from 23.82/0.853/0.199 to 26.21/0.906/0.148 (Cai et al., 6 Nov 2025).

Tracking and localization provide another set of examples. Joint spatial-temporal stereo tracking integrates dense object cues, centroid-associated local depth and local coordinates, stereo photometric alignment, temporal reprojection, and marginalization-based history compression, and it is reported to outperform previous image-based 3D tracking methods by significant margins on KITTI tracking (Li et al., 2020). Joint optimization of visual points and lines refines a coarse point-based camera pose with a point-line objective and reports gains on InLoc: on duc1 the fraction of queries within $\{\mathbf T_i\}$ 5, $\{\mathbf T_i\}$ 6, and $\{\mathbf T_i\}$ 7 becomes 50.5/72.7/86.9 % for the joint method versus 47.0/71.2/84.8 % for the point-only baseline; on duc2 the corresponding figures are 61.8/79.4/84.0 % versus 61.1/77.9/80.2 % (Gao et al., 2021).

5. Communications and sensing formulations

Joint geometric optimization is not confined to spatial reconstruction. In coded modulation, PS-GS jointly optimizes constellation geometry, point probabilities, bit labeling, and demapper for a given channel model and SNR range. On AWGN with $\{\mathbf T_i\}$ 8, code-rate $\{\mathbf T_i\}$ 9, PS-GS achieves virtually identical BMI to MB-QAM across 0–20 dB, about 0.1 bit gain over uniform GS, and BER $\mathcal G$ 0 at ≈1.1 dB lower SNR than 64-QAM when combined with IEEE 802.11n $\mathcal G$ 1; on mismatched Rayleigh block fading, PS-GS(2/3) outperforms MB-QAM by up to 0.4 bit in BMI (Aoudia et al., 2020). Here the “geometry” being optimized is constellation geometry rather than spatial scene structure, but the joint treatment of geometry and probability is explicit.

The nonlinear-fiber literature adopts a related joint-shaping perspective. The multi-dimensional joint probabilistic and geometric shaping strategy parameterizes a constellation by shell amplitudes $\mathcal G$ 2 and group probabilities $\mathcal G$ 3, then greedily maximizes a mismatched-decoding mutual information over these coupled degrees of freedom. In the reported 250 km unrepeated WDM setting, the proposed joint shaping reaches $\mathcal G$ 4 bits per 4D symbol, matching brute-force MB in the table but with only 784 points, improving by +0.25 over uniform $\mathcal G$ 5-QAM, increasing $\mathcal G$ 6 by $\mathcal G$ 7 dB over uniform, and yielding up to +0.5 bits/4D at high launch power when it collapses to a single amplitude (Yankov, 2019).

In movable-antenna DOA estimation, the variables are explicitly geometric in the spatial sense. The problem is to place $\mathcal G$ 8 movable antennas inside a region $\mathcal G$ 9 of fixed area $\{\mathbf X_k\}$ 0 with pairwise spacing at least $\{\mathbf X_k\}$ 1, minimizing a weighted 2D-DOA CRB. The analysis shows that an equilateral triangle yields the minimum overlap area for three equal-radius disks under the spacing constraint, motivating an equilateral triangular movable region. Under the resulting symmetry, the nonconvex CRB objective reduces to maximizing the average squared radial distance $\{\mathbf X_k\}$ 2, and the final rule is to select the candidate locations farthest from the centroid (Ye et al., 5 Apr 2026). The reported MUSIC power spectrum is sharpest for PMA, the main-lobe width is reduced by ~15–20%, at SNR=10 dB the PMA RMSE is ~30% lower than SMA and ~50% lower than UCA, and PMA resolves two sources separated by only 3°, whereas URA needs >10° (Ye et al., 5 Apr 2026).

6. Limitations, misconceptions, and open directions

A recurring misconception is that “joint” means simply freeing all variables simultaneously. Several papers argue against this. ProJo4D states that directly optimizing all parameters at the same time fails because the problem is highly non-convex and often non-differentiable; full jointly from scratch converges to poor geometry, while the three-stage schedule reduces final CD by >90% over sequential and >50% over full-joint (Rho et al., 5 Jun 2025). GloSplat makes a related point from a different angle: BARF, NeRF--, and 3RGS rely purely on photometric gradients, but when radiance primitives are sparse or misaligned, photometric loss alone permits “catastrophic drift,” which is why explicit SfM tracks are retained as geometric anchors (Xiong et al., 5 Mar 2026).

Another misconception is that end-to-end training requires differentiating through every inner-loop iteration. The bi-level flow/egomotion model instead uses implicit differentiation through a black-box RANSAC+IRLS solver, and modest variations in $\{\mathbf X_k\}$ 3 by $\{\mathbf X_k\}$ 4 to $\{\mathbf X_k\}$ 5 are reported to have only small impact on final EPE (Jiang et al., 2020). This suggests that robust joint coupling can be obtained without rewriting every geometric module as an unrolled differentiable program.

The literature is also explicit about unresolved difficulties. GPJA notes that small features such as tiny moles and freckles yield weak forces, teeth/tongue occlusions can still mislead mouth-contour fitting if the template view mask is imperfect, and the current formulation is limited to per-expression static scans without temporal smoothing (Wang et al., 2024). JOGS reports reliance on initial SfM, difficulty with non-Lambertian or textureless surfaces, possible gimbal locking from Euler angles, and computational cost from repeated pose updates and rendering passes (Li et al., 30 Oct 2025). Joint fitting and matching states that poor sampling of the initial model pool can reduce quality, that block-coordinate descent can get stuck in local minima, and that no theoretical approximation bound is given (Isack et al., 2013). ProJo4D likewise states that it does not provide a formal convergence proof for the non-convex joint problem (Rho et al., 5 Jun 2025).

Taken together, these works indicate that joint geometric optimization is most effective when the coupling reflects actual observability structure: geometry should be tied to photometric, reprojection, flow, rendering, or information-theoretic evidence by residuals that are informative at the same scales as the unknowns. A plausible implication is that future progress will depend less on making objectives ever more monolithic than on deciding which variable groups should be optimized jointly, which should be staged, and which require explicit geometric anchors to remain well conditioned.