PlückeRF: Line-Based 3D Reconstruction
- PlückeRF is a feed-forward 3D reconstruction method that represents both image pixel rays and 3D volume elements as lines using Plücker coordinates for explicit geometric alignment.
- It employs a learnable, closed-form line-to-line attention bias that promotes spatially correlated information sharing, reducing blurring in occluded regions.
- Empirical evaluations demonstrate improved PSNR and novel-view synthesis fidelity on benchmarks compared to traditional triplane and voxel-based methods.
PlückeRF is a feed-forward method for few-view 3D reconstruction that introduces a line-based 3D representation utilizing Plücker coordinates to implement an explicit, geometry-grounded connection between image pixel rays and the learned 3D scene representation. PlückeRF advances state-of-the-art feed-forward NeRF approaches by encoding both 3D volume elements and input image pixels as lines, allowing for preferential information sharing between spatially proximate regions in the image and scene via a learnable, closed-form line-to-line attention bias. This geometric inductive bias enables improved multi-view consistency and novel-view synthesis fidelity over triplane and voxel-based methods (Bahrami et al., 4 Jun 2025).
1. Motivation and Geometric Foundations
Feed-forward few-view 3D reconstruction models, such as triplane-based NeRF architectures, predict 3D scene structure directly from input images in a single forward pass, foregoing per-scene optimization. While these models leverage learned priors for shape and appearance, they typically treat multiple input views using feature concatenation or averaging, lacking explicit mechanisms to exploit projective geometry. Consequently, these methods cannot encode the association between 3D points and image pixels whose rays intersect those points, resulting in nondiscriminative information flow. This can cause blurring in occluded/unobserved regions and a weaker encoding of multi-view stereo cues.
PlückeRF introduces a mechanism whereby both the 3D scene and the image pixel rays are represented as lines (rather than points or voxels) using Plücker coordinates, a compact, origin-invariant representation of spatial lines. This facilitates the computation of a closed-form, differentiable distance measure between any two lines—image pixel rays and 3D grid-plane rays—which can then be introduced as a geometric bias in the model's attention mechanism. This bias anchors network attention to physically meaningful correspondences, promoting information sharing between spatially correlated regions while suppressing spurious cross-view connections.
2. Structured Line-Based Scene Representation
A 3D line in Plücker coordinates is defined as
where is a point on the line and is the unit direction. The homogeneity of Plücker coordinates is resolved by fixing in .
The minimal distance between lines and is given by
$d(\ell_1,\ell_2) = \begin{cases} \frac{\left| d_1^\top m_2 + d_2^\top m_1 \right|}{\| d_1 \times d_2 \|_2} & \text{if } d_1 \times d_2 \neq 0 \[2ex] \left\| d_1 \times \left( m_1 - (d_1^\top d_2)\,m_2 \right) \right\|_2 & \text{otherwise} \end{cases}$
Within the PlückeRF model, the internal representation retains triplanes with resolution per plane. For each pixel of each plane, a grid-plane ray orthogonal to the plane at location is converted to Plücker coordinates, constructing a set of “3D tokens” by applying a learned linear projection to each 6D Plücker vector. These line tokens, paired with their geometric metadata, provide an efficient storage compared to standard voxel grids, while explicitly encoding the spatial relationships necessary for accurate line-to-line attention computation.
3. Pixel-Ray Encoding and Attention Coupling
To couple image features with the 3D scene representation, each input image (with intrinsics and extrinsics ) is tokenized with a pre-trained DINOv2 ViT to yield patch features. Each patch’s ray is computed as: and encoded as .
Transformer attention is biased by geometric proximity: for (queries—scene lines) and (keys/values—image lines), cross-attention is
where is the line distance, and is a learnable weight. When two lines intersect, and attention is unpenalized; for distant lines, the negative bias suppresses attention.
Self-attention among 3D tokens uses the line distance matrix to bias information flow to spatially proximate lines within the scene representation, maintaining locality while retaining global transformer connectivity.
4. Network Design and Training Regimen
The PlückeRF architecture comprises the following components:
- Image Encoder: A DINOv2 ViT encodes input images into patch features (), each concatenated with its Plücker vector for positional encoding; the CLS token has zero distance/no bias.
- 3D Token Initialization: Each of the Plücker lines is projected via a learned linear layer into -dimensional token space for transformer processing.
- Transformer Decoder: A stack of 8 transformer blocks implements distance-biased self-attention (), distance-biased cross-attention (), feedforward MLPs, and normalization.
- Volume Rendering Decoder: 3D tokens are arranged into planes and upsampled. For novel view synthesis, per-pixel camera rays are cast, points per ray are projected to the three planes, and features are summed and decoded by an MLP into density and color , followed by standard NeRF-style volume rendering.
- Losses: The model is supervised by a photometric loss and LPIPS perceptual loss: with a small weight (e.g., 0.01 late in training).
- Optimization: AdamW with weight decay 0.05; key hyperparameters: batch size 32, input views, comparison views, 500K–800K iterations per object class, cosine LR schedule (peak for Chairs, for Cars). Inference time for novel view synthesis is approximately 15 s per object on an A100 GPU.
5. Empirical Evaluation and Ablation
Empirical results on ShapeNet-SRN Chairs and Cars benchmarks demonstrate that PlückeRF achieves higher reconstruction fidelity than prior feed-forward and triplane architectures. Table 1 summarizes main results (all metrics directly from (Bahrami et al., 4 Jun 2025)):
| Method (2 views) | PSNR (Chairs) | PSNR (Cars) | LPIPS (C) | SSIM (C) |
|---|---|---|---|---|
| pixelNeRF | 25.97 | 25.66 | - | - |
| SplatterImg | 25.72 | 26.01 | - | - |
| Ours w/o bias | 27.67 | 25.26 | - | - |
| Ours (PlückeRF) | 28.22 | 25.54 | 0.045/0.070 | 0.96/0.94 |
On extrapolated views (90° from both inputs), PlückeRF's gains are larger (Chairs PSNR = 27.97 vs pixelNeRF 25.33). Qualitatively, reconstructions show sharper outlines and details in unseen/occluded regions compared to strong blurring in baseline methods.
Ablation studies reveal that removal of the line-to-line attention bias, learnable , Plücker positional encoding, DINOv2 fine-tuning, or LPIPS loss all lead to measurable drops in reconstruction quality, confirming the critical role of the geometric bias.
6. Discussion and Implications
The core inductive bias of PlückeRF is the explicit coupling of 3D structure to image observations via transformer attention weighted by Plücker line distances, reflecting actual physical intersections. This geometric approach preserves the efficiency and speed of feed-forward NeRF-like models while overcoming prior limitations in spatial feature association.
Limitations persist: despite the geometric bias, inference remains slower than 2D image-based methods due to NeRF-style volume rendering, and blurring can still occur in highly unobserved regions. Future directions suggested include integrating the Plücker-based bias with diffusion models or adopting alternative sparse 3D bases, such as Gaussian mixtures, to improve efficiency and expressiveness.
In summary, PlückeRF establishes a mechanism for feed-forward 3D reconstruction frameworks that respects the underlying projective geometry of multi-view observations, achieving improved novel-view synthesis and generalization beyond the capabilities of triplane- and voxel-based attention mechanisms (Bahrami et al., 4 Jun 2025).