Optical-Domain Joint Projection Overview

Updated 4 July 2026

Optical-domain joint projection is a design principle that jointly encodes multi-source inputs into a single constrained operator for optical and computational imaging.
It spans diverse applications such as acousto-optical deflection, broadband diffractive optics, and projection-domain techniques in tomography and quality assessment.
This unified approach optimizes data fusion, improves reconstruction fidelity, and reduces artifacts by replacing sequential pairwise comparisons with a pre-encoded joint representation.

Optical-domain joint projection denotes, in the supplied literature, a family of constructions in which multiple inputs are combined before or during projection so that the resulting field, intensity pattern, similarity score, or projection-domain representation is jointly constrained rather than assembled from isolated pairwise relations. In physical optics, this includes orthogonal acousto-optical deflectors driven by coordinated multi-tone spectra, broadband diffractive-optical elements that map wavelength and propagation distance to different images, and opto-electronic joint transform correlators that project a joint power spectrum into a second Fourier stage. In computational imaging and representation learning, analogous formulations appear as target-to-subspace similarity in a shared embedding space, sinogram-domain generative modeling in CT and PET, and projection-based quality assessment of point clouds (Decruyenaere et al., 21 Apr 2026, Meem et al., 2019, Gamboa et al., 18 Mar 2025, Hu et al., 21 Jun 2026, Chen et al., 16 Jun 2025, Chen et al., 20 Jun 2025, Javaheri et al., 2021).

1. Joint projection as a general operator

Across these works, joint projection replaces a sequence of independent comparisons or scans with a single object that already encodes multi-source structure. In vision-language geo-localization, the retrieval problem is written as

$\hat{R}=\arg\max_{R_k\in\mathcal{R}} \operatorname{Sim}_{vl}(\mathbf{r}_k; \mathbf{v}, \mathbf{t}),$

where compatibility must reflect the image feature $\mathbf{v}$ and text feature $\mathbf{t}$ simultaneously rather than through separate cosine scores. In two-dimensional AOD projection, the joint mapping is

$\{f^x_k\}_k \times \{f^y_p\}_p \rightarrow \{(x_k,y_p)\}_{k,p},$

so the final intensity pattern is determined by the paired frequency content of two orthogonal deflectors. In broadband diffractive optics, one static height map implements a joint mapping from wavelength $\lambda$ , plane $z$ , and illumination geometry to a target image. In opto-electronic correlation, the jointly encoded object is the joint power spectrum $|R(u,v)+Q(u,v)|^2$ , which is then re-projected to an SLM and Fourier transformed again. In CT and PET, the corresponding construction is projection-domain synthesis, in which the modeled random object is the sinogram rather than the reconstructed image. In point-cloud assessment, the joint object is a set of aligned 2D projections constructed from geometry and color under two geometry conditions (Hu et al., 21 Jun 2026, Decruyenaere et al., 21 Apr 2026, Meem et al., 2019, Gamboa et al., 18 Mar 2025, Chen et al., 16 Jun 2025, Chen et al., 20 Jun 2025, Javaheri et al., 2021).

A plausible implication is that “joint projection” is less a single device class than a recurrent design principle: encode coupled constraints before the decisive optical transform, similarity computation, or reconstruction step. The supplied papers differ sharply in physics and objectives, but they repeatedly replace late fusion with a directly optimized joint operator.

2. Multi-anchor subspace projection in shared feature spaces

In vision-language geo-localization, the paper "MAPS: Multi-Anchor Projection Similarity for Joint Vision-Language Geo-Localization" formulates a joint image-text query as a multi-anchor geometric alignment problem. A ground-level image $G$ , a natural-language description $T$ , and a geo-referenced candidate $R_k$ are mapped to a shared $\mathbf{v}$ 0-dimensional embedding space,

$\mathbf{v}$ 1

with all features $\mathbf{v}$ 2-normalized to the unit hypersphere. Existing methods are described as a point-to-point alignment paradigm: they compute pairwise cosine similarities $\mathbf{v}$ 3 and $\mathbf{v}$ 4, then fuse them. MAPS instead treats the visual and textual features as anchors defining a joint subspace (Hu et al., 21 Jun 2026).

The anchor matrix is

$\mathbf{v}$ 5

and the anchor plane is $\mathbf{v}$ 6. With candidate feature $\mathbf{v}$ 7, the squared orthogonal distance to that plane is

$\mathbf{v}$ 8

which yields the projection magnitude

$\mathbf{v}$ 9

This is presented as the multi-anchor extension of cosine similarity: cosine is a scalar projection onto one direction, whereas MAPS measures the length of the projection onto the 2D subspace spanned by both anchors.

The method also introduces an orientation constraint. Writing the projection as

$\mathbf{t}$ 0

the valid sector is defined by $\mathbf{t}$ 1 and $\mathbf{t}$ 2. Out-of-sector projections are penalized through an angular deviation $\mathbf{t}$ 3 and orientation weight

$\mathbf{t}$ 4

The final similarity is

$\mathbf{t}$ 5

The same geometry is used during training through a MAPS-based contrastive loss,

$\mathbf{t}$ 6

which is structurally identical to an InfoNCE or CLIP loss but replaces pairwise cosine similarity with target-to-subspace similarity. The implementation uses CLIP ViT-L/14@336 as the shared visual backbone for ground-view query images and reference images, the CLIP text encoder for $\mathbf{t}$ 7, and a small numerical regularizer on $\mathbf{t}$ 8 when $\mathbf{t}$ 9 and $\{f^x_k\}_k \times \{f^y_p\}_p \rightarrow \{(x_k,y_p)\}_{k,p},$ 0 are nearly collinear. Adam, a small learning rate, cosine decay, batch size 64, and 40 epochs are reported. On CORE and CVG-Text, MAPS with UniMAG consistently outperforms pairwise cosine-based alignment and geometry-focused alternatives such as GRAM and PMRL, with improvements of $\{f^x_k\}_k \times \{f^y_p\}_p \rightarrow \{(x_k,y_p)\}_{k,p},$ 1 $\{f^x_k\}_k \times \{f^y_p\}_p \rightarrow \{(x_k,y_p)\}_{k,p},$ 2 on CORE and up to $\{f^x_k\}_k \times \{f^y_p\}_p \rightarrow \{(x_k,y_p)\}_{k,p},$ 3 $\{f^x_k\}_k \times \{f^y_p\}_p \rightarrow \{(x_k,y_p)\}_{k,p},$ 4 improvement on some subsets when used for both training and retrieval (Hu et al., 21 Jun 2026).

The key misconception addressed in this formulation is that a joint query can be treated as two independent references. The paper’s argument is that joint image-text queries are intrinsically multi-anchor: a correct candidate must be consistent with both modalities in the joint subspace, not merely score highly against each one separately.

3. Two-dimensional light-pattern projection with orthogonal AODs

In the AOD setting, optical-domain joint projection is a literal optical process. A narrowband CW laser at $\{f^x_k\}_k \times \{f^y_p\}_p \rightarrow \{(x_k,y_p)\}_{k,p},$ 5 nm passes through two perpendicular AODs, each driven by an independently programmable RF waveform from a 2-channel AWG. In the linear diffraction regime, only zeroth and first orders are present, and only the $\{f^x_k\}_k \times \{f^y_p\}_p \rightarrow \{(x_k,y_p)\}_{k,p},$ 6 order is relayed forward. Relay optics and a microscope objective perform an optical Fourier transform and image the diffracted field onto a glass window and then onto an sCMOS camera. For a single tone, spot position is linear in RF frequency,

$\{f^x_k\}_k \times \{f^y_p\}_p \rightarrow \{(x_k,y_p)\}_{k,p},$ 7

with experimentally calibrated constant

$\{f^x_k\}_k \times \{f^y_p\}_p \rightarrow \{(x_k,y_p)\}_{k,p},$ 8

With multi-tone driving, the two one-dimensional deflectors jointly generate an array of spots in the image plane (Decruyenaere et al., 21 Apr 2026).

Each AOD is driven by

$\{f^x_k\}_k \times \{f^y_p\}_p \rightarrow \{(x_k,y_p)\}_{k,p},$ 9

and the field in the image plane is

$\lambda$ 0

where $\lambda$ 1. Time averaging produces

$\lambda$ 2

The incoherent term,

$\lambda$ 3

is the desired sum of spot intensities. The crucial technical point is that in 2D there are additional coherent artifact terms satisfying

$\lambda$ 4

which survive time averaging. The paper explicitly identifies these as persistent intermodulation products in the optical domain.

Artifact suppression is achieved through an incommensurately staggered frequency lattice,

$\lambda$ 5

For effective pitch

$\lambda$ 6

the minimum diagonal separation between interfering spots is

$\lambda$ 7

Increasing $\lambda$ 8 increases $\lambda$ 9, thereby reducing coherent artifacts because PSF overlap decays with distance. This directly contradicts the common assumption that rapid time averaging alone is sufficient in two dimensions; the paper shows that specific joint spectral design is required.

For separable target patterns,

$z$ 0

a single periodic joint multi-tone state suffices and no scanning is required. The paper demonstrates a uniform $z$ 1m square and a parabolic pattern $z$ 2 over a $z$ 3m region. As $z$ 4 increases, RMSE relative to an artifact-free reference decreases toward the shot-noise limit. For non-separable patterns, the method uses a nonnegative SVD decomposition and scans only the rank- $z$ 5 components. A $z$ 6 gray-scale Mondrian image is projected using rank- $z$ 7 and $z$ 8 approximations with $z$ 9s and total times $|R(u,v)+Q(u,v)|^2$ 0s. The reported comparison with line scanning is that the incommensurate joint projection is consistently faster at the same RMSE (Decruyenaere et al., 21 Apr 2026).

4. Broadband diffractive optics as static joint projectors

Broadband diffractive-optical elements implement joint projection in a static diffractive surface. A single multilevel relief pattern of pixel heights $|R(u,v)+Q(u,v)|^2$ 1 is designed so that, under broadband illumination, it projects different images in different spectral bands, different images in different image planes, and different magnifications when the source–BDOE distance changes. The device therefore realizes a joint mapping from wavelength $|R(u,v)+Q(u,v)|^2$ 2, propagation distance $|R(u,v)+Q(u,v)|^2$ 3, and illumination geometry to target intensity patterns (Meem et al., 2019).

For wavelength $|R(u,v)+Q(u,v)|^2$ 4, the complex transmission of a pixel is

$|R(u,v)+Q(u,v)|^2$ 5

and propagation to a plane at distance $|R(u,v)+Q(u,v)|^2$ 6 is modeled by scalar Fresnel diffraction,

$|R(u,v)+Q(u,v)|^2$ 7

A central point in the paper is that these devices are designed in the Fresnel regime rather than as narrow far-field diffractive orders. The optimized quantity is imaging efficiency, averaged over design wavelengths and, for multi-plane devices, over planes: $|R(u,v)+Q(u,v)|^2$ 8

The reported demonstrations include a two-plane element that projects “+” at $|R(u,v)+Q(u,v)|^2$ 9 mm and “ $G$ 0” at $G$ 1 mm under $G$ 2 nm collimated white light, and a three-plane element with distinct images at $G$ 3, $G$ 4, and $G$ 5 mm. Crosstalk is quantified with SSIM. For the two-plane device, global SSIM between planes is $G$ 6 in simulation and $G$ 7 experimentally; for the three-plane device, simulation SSIM values are $G$ 8, $G$ 9, and $T$ 0, while experimental values are $T$ 1, $T$ 2, and $T$ 3. The paper presents this crosstalk as an intrinsic trade-off of optical-domain joint projection along $T$ 4: one relief pattern must satisfy conflicting constraints across multiple planes (Meem et al., 2019).

The multispectral example uses one BDOE to project a visible “rainbow heart” under $T$ 5 nm illumination and a lion silhouette under $T$ 6 nm illumination. Crosstalk between the visible and NIR images is low, with simulation SSIM $T$ 7 and measured SSIM $T$ 8. Average imaging efficiency for this visible+NIR device is $T$ 9 in simulation and $R_k$ 0 in measurement. Relative transmission efficiency is $R_k$ 1 over $R_k$ 2 nm for four BDOEs and $R_k$ 3 in $R_k$ 4 nm for the VIS/NIR BDOE. The paper also interprets the device as a lens whose point-spread function is the target image, obeying the thin-lens equation

$R_k$ 5

Fabrication details matter because they define the practical operating regime: pixel width $R_k$ 6m, maximum height $R_k$ 7m, 100 quantized height levels, and grayscale optical lithography in S1813 photoresist. Stylus profilometry shows height error standard deviation $R_k$ 8 nm. The paper argues that imprint-based replication makes such BDOEs suitable for large-area passive projectors and security elements (Meem et al., 2019).

5. Joint power-spectrum projection in opto-electronic correlators

Opto-electronic joint transform correlators form one of the most literal versions of joint projection. In a conventional OJTC, an input SLM displays a reference image $R_k$ 9 and a query image $\mathbf{v}$ 00 side-by-side. A Fourier-transform lens produces

$\mathbf{v}$ 01

and an FPA measures the joint power spectrum

$\mathbf{v}$ 02

This intensity is then re-displayed on a second SLM and optically Fourier transformed again, producing correlation outputs (Gamboa et al., 18 Mar 2025).

The bias problem arises because the self-intensity terms $\mathbf{v}$ 03 and $\mathbf{v}$ 04 are typically much stronger than the conjugate-product terms. The paper states that “the magnitude of the intensity terms will always be equal to or greater than the conjugate product terms.” In the numerical example used to illustrate finite-bit-depth limitations, after 0–255 scaling of the full JPS, the useful conjugate-product information occupies only the range $\mathbf{v}$ 05, or approximately $\mathbf{v}$ 06 of the 8-bit SLM range. A misconception directly addressed by the paper is that the whole JPS contributes equally to pattern recognition. In fact, the cross-correlation peaks are produced only by the conjugate-product terms; the self-intensity terms contribute background and consume dynamic range.

The proposed debiased or balanced architecture measures three intensity maps, $\mathbf{v}$ 07, $\mathbf{v}$ 08, and $\mathbf{v}$ 09, and computes

$\mathbf{v}$ 10

This balanced JPS is rescaled to the SLM range and Fourier transformed. Because only the cross-correlation-carrying terms are projected, the full available bit depth is devoted to the useful signal. The implementation discussed in the paper uses TI DLP471te DMD modules with resolution $\mathbf{v}$ 11, 8-bit depth, and up to $\mathbf{v}$ 12 fps, together with 10-bit Thorlabs Zelux CS165MU cameras. The subtraction and rescaling are described as simple arithmetic suitable for FPGA or GPU realization (Gamboa et al., 18 Mar 2025).

The reported gains are large under finite quantization. In simulation, autocorrelation of a USAF chart yields a conventional JTC peak of approximately $\mathbf{v}$ 13 versus $\mathbf{v}$ 14 for BOJTC, feature extraction of a square gives approximately $\mathbf{v}$ 15 enhancement, and extracting the digit “1” from “1951” gives approximately $\mathbf{v}$ 16 enhancement. The abstract summarizes the overall effect as a nearly two orders of magnitude signal-to-noise ratio improvement under some conditions. In hardware, extracting digit “4” yields an output peak of $\mathbf{v}$ 17 for BOJTC and $\mathbf{v}$ 18 for JTC, an enhancement of approximately $\mathbf{v}$ 19. When the output Fourier transform is performed in software with a “perfect” SLM, BOJTC and JTC both reach approximately $\mathbf{v}$ 20, showing that the practical benefit derives from better use of limited SLM dynamic range. The cost is alignment sensitivity: deliberate misalignment drops the BOJTC peak to approximately $\mathbf{v}$ 21, while the JTC peak remains $\mathbf{v}$ 22 (Gamboa et al., 18 Mar 2025).

6. Projection-domain generalizations in tomography and visual quality assessment

A related extension of joint projection appears in computational imaging, where the primary random object is not a reconstructed image but the projection data itself. In CT, PRO models sinograms with a latent diffusion framework conditioned by anatomical text prompts. Instead of learning $\mathbf{v}$ 23, it learns

$\mathbf{v}$ 24

with latent encoding

$\mathbf{v}$ 25

a diffusion loss

$\mathbf{v}$ 26

and downstream reconstruction through

$\mathbf{v}$ 27

The stated motivation is that Radon-space data encode “attenuation properties, geometric structures, and view dependent anatomical information” that are degraded or lost during reconstruction. On 1,000 generated CT images, PRO reports FID $\mathbf{v}$ 28, IS $\mathbf{v}$ 29, and KID $\mathbf{v}$ 30, outperforming image-domain baselines on IS and matching or improving FID and KID. PRO-generated data also improve sparse-view and low-dose reconstruction: for GMSD, training on PRO data gives PSNR $\mathbf{v}$ 31 and SSIM $\mathbf{v}$ 32, compared with $\mathbf{v}$ 33 and $\mathbf{v}$ 34 using real AAPM data; for OSDM at noise level $\mathbf{v}$ 35, PSNR improves from $\mathbf{v}$ 36 to $\mathbf{v}$ 37 while MSE remains $\mathbf{v}$ 38 (Chen et al., 16 Jun 2025).

PET tracer conversion adopts a more explicitly joint diffusion construction in projection space. PJDM uses paired $\mathbf{v}$ 39F-FDG and $\mathbf{v}$ 40F-DOPA sinograms in a coarse estimation stage based on a denoising diffusion bridge model, followed by a prior refinement stage using a conditional DDPM. The bridge defines endpoints

$\mathbf{v}$ 41

while the refinement stage conditions on a degraded prior $\mathbf{v}$ 42 and optimizes

$\mathbf{v}$ 43

The training uses $\mathbf{v}$ 44 paired FDG–DOPA scans for the bridge and $\mathbf{v}$ 45 unpaired DOPA scans for refinement. CE uses maximum sampling steps $\mathbf{v}$ 46, and the prior is introduced at timestep $\mathbf{v}$ 47. On reconstructed images, PJDM reports PSNR $\mathbf{v}$ 48, SSIM $\mathbf{v}$ 49, and NRMSE $\mathbf{v}$ 50, compared with PSNR $\mathbf{v}$ 51, SSIM $\mathbf{v}$ 52, and NRMSE $\mathbf{v}$ 53 for the cold diffusion baseline. Ablation shows that using both CE and PR is superior to either alone: PSNR $\mathbf{v}$ 54 versus $\mathbf{v}$ 55 without PR and $\mathbf{v}$ 56 without CE (Chen et al., 20 Jun 2025).

Projection-domain reasoning also appears in perceptual quality assessment of point clouds. JGC-ProjQM projects reference and degraded point clouds onto six cube faces,

$\mathbf{v}$ 57

using occupancy maps and near/far depth maps to form aligned 2D optical-domain views. Geometry and color are handled jointly through two branches: one uses reference geometry and recolors degraded attributes onto it, the other uses degraded geometry and recolors reference attributes onto it. Projection visibility is enforced through rules such as

$\mathbf{v}$ 58

and

$\mathbf{v}$ 59

After filtering, cropping, and Navier–Stokes inpainting, 2D IQA metrics are applied to the projections and fused as

$\mathbf{v}$ 60

With DISTS as the 2D metric, the method reports PLCC $\mathbf{v}$ 61, SROCC $\mathbf{v}$ 62, and RMSE $\mathbf{v}$ 63, compared with PLCC $\mathbf{v}$ 64 and $\mathbf{v}$ 65 for D1-PSNR and D2-PSNR, respectively. The paper summarizes the Pearson correlation gains regarding D1-PSNR and D2-PSNR as $\mathbf{v}$ 66 and $\mathbf{v}$ 67 when all coding degradations are considered (Javaheri et al., 2021).

Taken together, these projection-domain methods extend optical-domain joint projection beyond literal light-field synthesis. They preserve the same structural commitment: operate where joint constraints are physically or geometrically native—subspace, spectrum, sinogram, or projected view—rather than after those constraints have been weakened by pairwise fusion, sequential scanning, or inverse reconstruction.