Video Object Segmentation with Dynamic Query Modulation (2403.11529v1)
Abstract: Storing intermediate frame segmentations as memory for long-range context modeling, spatial-temporal memory-based methods have recently showcased impressive results in semi-supervised video object segmentation (SVOS). However, these methods face two key limitations: 1) relying on non-local pixel-level matching to read memory, resulting in noisy retrieved features for segmentation; 2) segmenting each object independently without interaction. These shortcomings make the memory-based methods struggle in similar object and multi-object segmentation. To address these issues, we propose a query modulation method, termed QMVOS. This method summarizes object features into dynamic queries and then treats them as dynamic filters for mask prediction, thereby providing high-level descriptions and object-level perception for the model. Efficient and effective multi-object interactions are realized through inter-query attention. Extensive experiments demonstrate that our method can bring significant improvements to the memory-based SVOS method and achieve competitive performance on standard SVOS benchmarks. The code is available at https://github.com/zht8506/QMVOS.
- “Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion,” in CVPR, 2021, pp. 5559–5568.
- “Xmem++: Production-level video segmentation from few annotated frames,” in ICCV, 2023, pp. 635–644.
- “Tracking anything with decoupled video segmentation,” in ICCV, 2023, pp. 1316–1326.
- “Tackling background distraction in video object segmentation,” in ECCV. Springer, 2022, pp. 446–462.
- “End-to-end object detection with transformers,” in ECCV. Springer, 2020, pp. 213–229.
- “Masked-attention mask transformer for universal image segmentation,” in CVPR, 2022, pp. 1290–1299.
- “Mask dino: Towards a unified transformer-based framework for object detection and segmentation,” in CVPR, 2023, pp. 3041–3050.
- “Boxsnake: Polygonal instance segmentation with box supervision,” arXiv preprint arXiv:2303.11630, 2023.
- “Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model,” in ECCV. Springer, 2022, pp. 640–658.
- “One-shot video object segmentation,” in CVPR, 2017, pp. 221–230.
- “Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf,” in CVPR, 2018, pp. 5977–5986.
- “Premvos: Proposal-generation, refinement and merging for video object segmentation,” in ACCV. Springer, 2018, pp. 565–580.
- “Video object segmentation using space-time memory networks,” in ICCV, 2019, pp. 9226–9235.
- “Rethinking space-time networks with improved memory coverage for efficient video object segmentation,” NeurIPS, vol. 34, pp. 11781–11794, 2021.
- “Joint inductive and transductive learning for video object segmentation,” in ICCV, 2021, pp. 9670–9679.
- “Associating objects with transformers for video object segmentation,” NeurIPS, vol. 34, pp. 2491–2502, 2021.
- “Dynamicbev: Leveraging dynamic queries and temporal context for 3d object detection,” arXiv preprint arXiv:2310.05989, 2023.
- “Attention is all you need,” NeurIPS, vol. 30, 2017.
- “Unihead: unifying multi-perception for detection heads,” arXiv preprint arXiv:2309.13242, 2023.
- “Etdnet: efficient transformer-based detection network for surface defect detection,” IEEE Transactions on Instrumentation and Measurement, 2023.
- “Ndc-scene: Boost monocular 3d semantic scene completion in normalized device coordinates space,” in ICCV. IEEE Computer Society, 2023, pp. 9421–9431.
- “Semanticac: semantics-assisted framework for audio classification,” in ICASSP. IEEE, 2023, pp. 1–5.
- “Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection,” arXiv preprint arXiv:2311.16464, 2023.
- “A benchmark dataset and evaluation methodology for video object segmentation,” in CVPR, 2016, pp. 724–732.
- “The 2017 davis challenge on video object segmentation,” arXiv preprint arXiv:1704.00675, 2017.
- “Collaborative video object segmentation by foreground-background integration,” in ECCV. Springer, 2020, pp. 332–348.
- “Learning what to learn for video object segmentation,” in ECCV. Springer, 2020, pp. 777–794.
- “Sstvos: Sparse spatiotemporal transformers for video object segmentation,” in CVPR, 2021, pp. 5912–5921.
- “Collaborative video object segmentation by multi-scale foreground-background integration,” TPAMI, vol. 44, no. 9, pp. 4701–4712, 2021.
- “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- “Youtube-vos: A large-scale video object segmentation benchmark,” arXiv preprint arXiv:1809.03327, 2018.
- “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- Hantao Zhou (7 papers)
- Runze Hu (15 papers)
- Xiu Li (166 papers)