Semantic Flow: Learning Semantic Field of Dynamic Scenes from Monocular Videos (2404.05163v1)
Abstract: In this work, we pioneer Semantic Flow, a neural semantic representation of dynamic scenes from monocular videos. In contrast to previous NeRF methods that reconstruct dynamic scenes from the colors and volume densities of individual points, Semantic Flow learns semantics from continuous flows that contain rich 3D motion information. As there is 2D-to-3D ambiguity problem in the viewing direction when extracting 3D flow features from 2D video frames, we consider the volume densities as opacity priors that describe the contributions of flow features to the semantics on the frames. More specifically, we first learn a flow network to predict flows in the dynamic scene, and propose a flow feature aggregation module to extract flow features from video frames. Then, we propose a flow attention module to extract motion information from flow features, which is followed by a semantic network to output semantic logits of flows. We integrate the logits with volume densities in the viewing direction to supervise the flow features with semantic labels on video frames. Experimental results show that our model is able to learn from multiple dynamic scenes and supports a series of new tasks such as instance-level scene editing, semantic completions, dynamic scene tracking and semantic adaption on novel scenes. Codes are available at https://github.com/tianfr/Semantic-Flow/.
- Social LSTM: Human trajectory prediction in crowded spaces. In CVPR, pp. 961–971, 2016.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 6299–6308, 2017.
- MVSNeRF: Fast generalizable radiance field reconstruction from multi-view stereo. In ICCV, pp. 14124–14133, 2021.
- ImageNet: A large-scale hierarchical image database. In CVPR, pp. 248–255, 2009.
- Depth-supervised NeRF: Fewer views and faster training for free. In CVPR, June 2022.
- Flownet: Learning optical flow with convolutional networks. In ICCV, 2015.
- NeRF-SOS: Any-view self-supervised object segmentation on complex scenes. In ICLR, 2023. URL https://openreview.net/forum?id=kfOtMqYJlUU.
- SlowFast networks for video recognition. In ICCV, pp. 6201–6210, 2019. doi: 10.1109/ICCV.2019.00630.
- Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In CVPR, pp. 8649–8658, June 2021.
- Dynamic view synthesis from dynamic monocular video. In ICCV, 2021.
- StyleNeRF: A style-based 3D aware generator for high-resolution image synthesis. In ICLR, 2022.
- Deep residual learning for image recognition. CVPR, pp. 770–778, 2016.
- Mask r-cnn. In ICCV, Oct 2017.
- Planning-oriented autonomous driving. In CVPR, 2023.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Panoptic Neural Fields: A Semantic Object-Aware Neural Scene Representation. In CVPR, 2022.
- Neural scene flow fields for space-time view synthesis of dynamic scenes. In CVPR, 2021.
- Semantic Ray: Learning a generalizable semantic field with cross-reprojection attention. In CVPR, 2023a.
- Neural sparse voxel fields. In NeurIPS, volume 33, 2020.
- Robust dynamic radiance fields. In CVPR, 2023b.
- NeRF in the Wild: Neural radiance fields for unconstrained photo collections. In CVPR, 2021.
- NeRF: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- NeRF in the dark: High dynamic range view synthesis from noisy raw images. In CVPR, pp. 16169–16178, 2022. doi: 10.1109/CVPR52688.2022.01571.
- GIRAFFE: Representing scenes as compositional generative neural feature fields. In CVPR, pp. 11453–11464, June 2021.
- Neural scene graphs for dynamic scenes. In CVPR, pp. 2856–2865, June 2021.
- Nerfies: Deformable neural radiance fields. ICCV, 2021.
- Neural Body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In CVPR, 2021.
- D-NeRF: Neural Radiance Fields for Dynamic Scenes. In CVPR, 2020.
- Yi-Ling Qiao et al. Neuphysics: Editable neural geometry and physics from monocular videos. In NeurIPS, 2022.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI, 44(3):1623–1637, 2022.
- NeRV: Neural reflectance and visibility fields for relighting and view synthesis. In CVPR, 2021.
- RAFT: Recurrent All-Pairs field transforms for optical flow. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), ECCV, pp. 402–419, 2020.
- MonoNeRF: Learning a generalizable dynamic radiance field from monocular videos. In ICCV, 2023.
- Video tracking: a concise survey. IEEE Journal of oceanic engineering, 31(2):520–529, 2006.
- Video segmentation via object flow. In CVPR, pp. 3899–3908, 2016.
- Three-dimensional scene flow. In ICCV, volume 2, pp. 722–729 vol.2, 1999. doi: 10.1109/ICCV.1999.790293.
- NeSF: Neural semantic fields for generalizable semantic segmentation of 3d scenes. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=ggPhsYCsm9.
- HumanNeRF: Free-viewpoint rendering of moving people from monocular video. In CVPR, pp. 16210–16220, June 2022.
- D2NeRF: Self-supervised decoupling of dynamic and static objects from a monocular video. In NeurIPS, 2022.
- Uncovering the missing pattern: Unified framework towards trajectory imputation and prediction. In CVPR, pp. 9632–9643, 2023.
- Learning object-compositional neural radiance field for editable scene rendering. In ICCV, October 2021.
- Zongxin Yang and Yi Yang. Decoupling features in hierarchical propagation for video object segmentation. In NeurIPS, 2022.
- Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In CVPR, June 2020.
- pixelNeRF: Neural radiance fields from one or few images. In CVPR, 2021.
- In-place scene labelling and understanding with implicit scene representation. In ICCV, 2021.