Cycle-Consistency Learning for Captioning and Grounding (2312.15162v1)
Abstract: We present that visual grounding and image captioning, which perform as two mutually inverse processes, can be bridged together for collaborative training by careful designs. By consolidating this idea, we introduce CyCo, a cyclic-consistent learning framework to ameliorate the independent training pipelines of visual grounding and image captioning. The proposed framework (1) allows the semi-weakly supervised training of visual grounding; (2) improves the performance of fully supervised visual grounding; (3) yields a general captioning model that can describe arbitrary image regions. Extensive experiments show that our fully supervised grounding model achieves state-of-the-art performance, and the semi-weakly supervised one also exhibits competitive performance compared to the fully supervised counterparts. Our image captioning model has the capability to freely describe image regions and meanwhile shows impressive performance on prevalent captioning benchmarks.
- Spice: Semantic propositional image caption evaluation. In ECCV.
- Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
- METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshop.
- Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In CVPR.
- Uniter: Universal image-text representation learning. In ECCV.
- Show, control and tell: A framework for generating controllable and grounded captions. In CVPR.
- Transvg: End-to-end visual grounding with transformers. In ICCV.
- Transvg++: End-to-end visual grounding with language conditioned vision transformer. IEEE TPAMI.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone. arXiv preprint arXiv:2206.07643.
- Large-scale adversarial training for vision-and-language representation learning. In NeurIPS.
- Mscap: Multi-style image captioning with unpaired stylized text. In CVPR.
- Dual learning for machine translation. In NeurIPS.
- Scaling up vision-language pre-training for image captioning. arXiv preprint arXiv:2111.12233.
- Vivo: Surpassing human performance in novel object captioning with visual vocabulary pre-training. In AAAI.
- Attention on attention for image captioning. In ICCV.
- Densecap: Fully convolutional localization networks for dense captioning. In CVPR.
- MDETR-modulated detection for end-to-end multi-modal understanding. In ICCV.
- Deep visual-semantic alignments for generating image descriptions. In CVPR.
- Referitgame: Referring to objects in photographs of natural scenes. In EMNLP.
- A region-based image caption generator with refined descriptions. Neurocomputing, 272: 416–424.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1): 32–73.
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. arXiv preprint arXiv:2201.12086.
- Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. In ACL.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV.
- Microsoft coco: Common objects in context. In ECCV.
- Learning to assemble neural module tree networks for visual grounding. In ICCV.
- Relation-aware instance refinement for weakly supervised visual grounding. In CVPR.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR.
- Generation and comprehension of unambiguous object descriptions. In CVPR.
- Im2text: Describing images using 1 million captioned photographs. In NeurIPS.
- X-linear attention networks for image captioning. In CVPR.
- Bleu: a method for automatic evaluation of machine translation. In ACL.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV.
- Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.
- Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114.
- Cycle-consistency for robust visual question answering. In CVPR.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL.
- Language adaptive weight generation for multi-task visual grounding. In CVPR.
- Vl-bert: Pre-training of generic visual-linguistic representations. In ICLR.
- Attention is all you need. In NeurIPS.
- Cider: Consensus-based image description evaluation. In CVPR.
- Show and tell: A neural image caption generator. In CVPR.
- Improving weakly supervised visual grounding by contrastive knowledge distillation. In CVPR.
- Unsupervised deep tracking. In CVPR.
- Efficient image captioning for edge devices. In AAAI.
- Controllable image captioning via prompting. In AAAI.
- Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML.
- Learning correspondence from the cycle-consistency of time. In CVPR.
- End-to-End Transformer Based Model for Image Captioning. In AAAI.
- Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904.
- Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning. In CVPR.
- Dynamic graph attention for referring expression comprehension. In ICCV.
- A fast and accurate one-stage approach to visual grounding. In ICCV.
- Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding. In CVPR.
- Ernie-vil: Knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934.
- Modeling context in referring expressions. In ECCV.
- A joint speaker-listener-reinforcer model for referring expressions. In CVPR.
- Vinvl: Revisiting visual representations in vision-language models. In CVPR.
- Unified vision-language pre-training for image captioning and vqa. In AAAI.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV.
- Ning Wang (300 papers)
- Jiajun Deng (75 papers)
- Mingbo Jia (3 papers)