Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neural Assembler: Learning to Generate Fine-Grained Robotic Assembly Instructions from Multi-View Images (2404.16423v1)

Published 25 Apr 2024 in cs.CV and cs.RO

Abstract: Image-guided object assembly represents a burgeoning research topic in computer vision. This paper introduces a novel task: translating multi-view images of a structural 3D model (for example, one constructed with building blocks drawn from a 3D-object library) into a detailed sequence of assembly instructions executable by a robotic arm. Fed with multi-view images of the target 3D model for replication, the model designed for this task must address several sub-tasks, including recognizing individual components used in constructing the 3D model, estimating the geometric pose of each component, and deducing a feasible assembly order adhering to physical rules. Establishing accurate 2D-3D correspondence between multi-view images and 3D objects is technically challenging. To tackle this, we propose an end-to-end model known as the Neural Assembler. This model learns an object graph where each vertex represents recognized components from the images, and the edges specify the topology of the 3D model, enabling the derivation of an assembly plan. We establish benchmarks for this task and conduct comprehensive empirical evaluations of Neural Assembler and alternative solutions. Our experiments clearly demonstrate the superiority of Neural Assembler.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Analysis by synthesis: a (re-) emerging program of research for language and vision. Biolinguistics, 4(2-3):174–200, 2010.
  2. Fs-detr: Few-shot detection transformer with prompting and without re-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11793–11802, 2023.
  3. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  4. Data-driven suggestions for creativity support in 3d modeling. In ACM SIGGRAPH Asia 2010 papers, pages 1–10. 2010.
  5. Order-aware generative modeling using the 3d-craft dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1764–1773, 2019.
  6. Brick-by-brick: Combinatorial construction with deep reinforcement learning. Advances in Neural Information Processing Systems, 34:5745–5757, 2021.
  7. Reltr: Relation transformer for scene graph generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  8. Learning to infer graphics programs from hand-drawn images. Advances in neural information processing systems, 31, 2018.
  9. Long short-term memory. Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012.
  10. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  11. 3d-sis: 3d semantic instance segmentation of rgb-d scans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4421–4430, 2019.
  12. Multi-view pointnet for 3d scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
  13. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3668–3678, 2015.
  14. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
  15. Image generation from scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1219–1228, 2018.
  16. Combinatorial 3D shape generation via sequential assembly. In NeurIPS Workshop on Machine Learning for Engineering Modeling, Simulation, and Design (ML4Eng), 2020.
  17. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  18. Finding an optimal lego® brick layout of voxelized 3d object using a genetic algorithm. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pages 1215–1222, 2015.
  19. Sgtr: End-to-end scene graph generation with transformer. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19486–19496, 2022.
  20. Learning 3d part assembly from a single image. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 664–682. Springer, 2020.
  21. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  22. Learning to describe scenes with programs. In International conference on learning representations, 2019.
  23. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7086–7096, 2022.
  24. Structurenet: Hierarchical graph networks for 3d shape generation. arXiv preprint arXiv:1908.00575, 2019.
  25. Atlas: End-to-end 3d scene reconstruction from posed images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 414–431. Springer, 2020.
  26. Im2struct: Recovering 3d shape structure from a single rgb image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4521–4529, 2018.
  27. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  28. Graph representation for order-aware visual transformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22793–22802, 2023.
  29. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  30. Dynamic furniture modeling through assembly instructions. ACM Trans. Graph., 35(6), 2016.
  31. Structure recovery by part assembly. ACM Transactions on Graphics (TOG), 31(6):1–11, 2012.
  32. Can robots assemble an ikea chair? Science Robotics, 3(17):eaat6385, 2018.
  33. Complementme: Weakly-supervised component suggestions for 3d modeling. ACM Transactions on Graphics (TOG), 36(6):1–12, 2017.
  34. Part-based modelling of compound scenes from images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 878–886, 2015.
  35. Learning 3d semantic scene graphs from 3d indoor reconstructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3961–3970, 2020.
  36. Break and make: Interactive structural understanding using lego bricks. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 90–107. Springer, 2022.
  37. Translating a visual lego manual to a machine-executable plan. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pages 677–694. Springer, 2022a.
  38. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022b.
  39. Learning to see physics via visual de-animation. Advances in Neural Information Processing Systems, 30, 2017.
  40. Co-salient object detection with semantic-level consensus extraction and dispersion. In Proceedings of the 31st ACM International Conference on Multimedia, pages 2744–2755, 2023.
  41. Vision as bayesian inference: analysis by synthesis? Trends in cognitive sciences, 10(7):301–308, 2006.
  42. Generative 3d part assembly via dynamic graph learning. Advances in Neural Information Processing Systems, 33:6315–6326, 2020.

Summary

We haven't generated a summary for this paper yet.