- The paper presents a novel approach that combines transformer-based 3D line segment detection with analytic geometric regression for efficient 6DoF pose estimation using limited data.
- The method decomposes pose estimation into detecting four key top-edge line segments and applying robust geometric fitting to accurately infer bin orientation and position.
- Empirical results show reduced translation (3.1 cm) and rotation errors (8.3°), outperforming previous methods without requiring instance-specific CAD models.
Detecting 3D Line Segments for 6DoF Pose Estimation with Limited Data
Problem Overview and Motivation
Estimating the 6DoF pose of objects in 3D vision remains central to applications in industrial automation, where accurate localization of objects—such as bins used in material handling—is required for tasks like bin picking and robot arm trajectory planning. Existing deep learning methods for 6DoF pose estimation generally require extensive annotated datasets or CAD models for each object instance, imposing strong data and modeling constraints that hinder practical deployment across variable, real-world settings. This work specifically tackles the industrial bin pose estimation scenario, characterized by limited real data and the absence of CAD models, by leveraging geometric priors inherent to bin-like objects.
Methodology
The proposed approach decomposes the 6DoF pose estimation task into two stages: (i) robust 3D line segment detection and (ii) geometric regression for pose inference. The core hypothesis is that intermediate geometric primitives—specifically, the four 3D top-edge line segments of cuboidal bins—provide an efficient, data-efficient basis for downstream pose computation.
The pipeline is shown schematically below.
Figure 1: Overview of the bin pose estimation pipeline, illustrating detection of 3D line segments from the structured point cloud and subsequent geometric regression to 6DoF pose.
3D Line Segment Detection via LeTR Adaptation
At the heart of the system is a novel adaptation of LeTR, a transformer-based model originally formulated for 2D line segment detection. The network is modified for structured point clouds: the final prediction head is reparameterized for six outputs representing 3D endpoints instead of four for 2D, and normalization schemes are altered due to unknown global depth bounds.
Training is conducted in three stages, following the original LeTR protocol but leveraging pretraining on 2D inputs for improved convergence. The loss employs a bipartite matching strategy and combines binary cross-entropy for classification and L1 endpoint regression loss. Notably, empirical results demonstrate that predicting only four 3D line segments aligns best with optimal bin pose estimation accuracy.

Figure 2: Visualization of detected 3D line segments and annotated lines (top), and corresponding estimated vs. ground-truth bin pose (bottom).
Pose Regression via Geometric Fitting
Given the top-4 confident line segment detections, pose estimation proceeds as follows:
- Endpoints are centered, and a plane is robustly fitted via SVD to infer bin orientation.
- Line directions are exploited to resolve the bin’s principal axes, utilizing the two longest predicted lines.
- Pose rotation is established using Gram-Schmidt orthogonalization on the orientation normal and averaged direction vector.
- Translation estimation applies iterative merging to correct for possible clustering of endpoints, followed by alignment with bin height priors for accurate centroid prediction.
This hybrid geometric learning-decomposition sidesteps common pitfalls of direct pose regression, such as inappropriate rotation representations or misalignment due to symmetries.
Dataset Construction and Augmentation
A critical contribution is the creation and public release of an extended dataset comprising both real and synthetic 3D scans of industrial bins. The real scans are obtained using a dual-camera robotic arm fixture, covering multiple views and bin types. Synthetic scans are generated to introduce variability and address the limited size of real data, a strategy shown empirically to improve downstream generalization.

Figure 3: Diverse samples from the new dataset; green lines depict annotated top edges. Left: real scans, right: synthetic data.
The dataset is stratified into train, validation, and test splits by entire scenes, ensuring no scene-level leakage and maintaining strong reproducibility for benchmarking.
Experimental Results and Analysis
The evaluation benchmarks the proposed method (LeTR 3D) against prior deep CNN-based regression ("Towards Deep Learning-based 6D Bin Pose Estimation in 3D Scans" [prepravky]), state-of-the-art category-level methods (HS-Pose [hspose]), and recent foundation models (FoundationPose [wen2024foundationpose]). All baselines, except FoundationPose, are retrained with the new dataset using available CAD models or bin dimensions for established methods.
A series of ablation studies examines the impact of:
- Number of prediction queries: 4 yields minimal regression error.
- Cutout augmentation: negligible effect on translation error but visible reduction in rotation error at increased occlusion rates (cmax​=0.8).
- Synthetic data: Inclusion boosts both translation and rotation accuracy, underlining the utility of simulated scans for model robustness.

Figure 4: Predicted line segments (test set examples); note inference of occluded/missing bin corners, highlighting the model's geometric reasoning.
On the test set, LeTR 3D demonstrates mean translation and rotation errors of 3.1 cm and 8.3°, respectively, outperforming all baselines. Notably, it achieves this without requiring instance-specific CAD models or segmentation masks during inference. Baseline methods report significantly higher error, with direct regression yielding up to 25.2° rotation error, and FoundationPose and HS-Pose showing strong performance deterioration due to imperfect bin models and lack of adaptation to scarce-data and non-CAD scenarios.
Implications and Limitations
This work validates a modular approach where geometric prior structure and intermediate representation learning provide strong inductive biases, facilitating accurate pose estimation with limited domain data. The findings suggest that, for structured objects exhibiting geometric regularity (e.g., industrial bins), direct regression approaches can be suboptimal, especially when data or modeling resources are restricted.
This strategy has direct implications for rapid deployment of pose estimation systems in dynamic industrial environments, where collecting extensive annotated scans or CAD models remains infeasible. Additionally, the dataset and code release contribute valuable resources for benchmarking and method development in practical 6DoF pose estimation.
Nevertheless, the method’s design presupposes the bins have parallel, cuboidal top edges and are oriented upright. It cannot currently generalize to bins with substantial geometric deviations (e.g., oval corners, overturned orientation) or resolve pose in ambiguous or heavily occluded contexts. Extending the pipeline to handle arbitrary orientation, more complex bin geometries, or integrating voting-based filtering for robust line selection remain open research questions.
Conclusion
By combining transformer-based 3D line segment detection with analytic geometric regression, this work establishes a data-efficient and annotation-light solution for 6DoF pose estimation of cuboidal bins. The approach is empirically validated to outperform established and recent state-of-the-art methods, particularly in the real-world regime without instance-specific CAD models. Future research can build upon this foundation by enhancing synthetic data realism, expanding to generalized polyhedral shapes, and introducing orientation-agnostic priors for broader applicability.