Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative Grounding (2311.01091v2)

Published 2 Nov 2023 in cs.CV

Abstract: Panoptic narrative grounding (PNG) aims to segment things and stuff objects in an image described by noun phrases of a narrative caption. As a multimodal task, an essential aspect of PNG is the visual-linguistic interaction between image and caption. The previous two-stage method aggregates visual contexts from offline-generated mask proposals to phrase features, which tend to be noisy and fragmentary. The recent one-stage method aggregates only pixel contexts from image features to phrase features, which may incur semantic misalignment due to lacking object priors. To realize more comprehensive visual-linguistic interaction, we propose to enrich phrases with coupled pixel and object contexts by designing a Phrase-Pixel-Object Transformer Decoder (PPO-TD), where both fine-grained part details and coarse-grained entity clues are aggregated to phrase features. In addition, we also propose a PhraseObject Contrastive Loss (POCL) to pull closer the matched phrase-object pairs and push away unmatched ones for aggregating more precise object contexts from more phrase-relevant object tokens. Extensive experiments on the PNG benchmark show our method achieves new state-of-the-art performance with large margins.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Tianrui Hui (15 papers)
  2. Zihan Ding (38 papers)
  3. Junshi Huang (24 papers)
  4. Xiaoming Wei (44 papers)
  5. Xiaolin Wei (42 papers)
  6. Jiao Dai (17 papers)
  7. Jizhong Han (48 papers)
  8. Si Liu (130 papers)
Citations (4)