Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Solution for the ICCV 2023 1st Scientific Figure Captioning Challenge

Published 26 Mar 2024 in cs.CV and cs.AI | (2403.17342v1)

Abstract: In this paper, we propose a solution for improving the quality of captions generated for figures in papers. We adopt the approach of summarizing the textual content in the paper to generate image captions. Throughout our study, we encounter discrepancies in the OCR information provided in the official dataset. To rectify this, we employ the PaddleOCR toolkit to extract OCR information from all images. Moreover, we observe that certain textual content in the official paper pertains to images that are not relevant for captioning, thereby introducing noise during caption generation. To mitigate this issue, we leverage LLaMA to extract image-specific information by querying the textual content based on image mentions, effectively filtering out extraneous information. Additionally, we recognize a discrepancy between the primary use of maximum likelihood estimation during text generation and the evaluation metrics such as ROUGE employed to assess the quality of generated captions. To bridge this gap, we integrate the BRIO model framework, enabling a more coherent alignment between the generation and evaluation processes. Our approach ranked first in the final test with a score of 4.49.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. Meta GenAI. Llama 2: Open foundation and fine-tuned chat models, 2023. https://arxiv.org/pdf/2307.09288.pdf.
  2. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022. https://arxiv.org/pdf/2201.12086.pdf.
  3. Yang Liu. Fine-tune bert for extractive summarization, 2019. https://arxiv.org/pdf/1903.10318.pdf.
  4. Brio: Bringing order to abstractive summarization, 2022. https://arxiv.org/pdf/2203.16804.
  5. OpenAI. Language models are few-shot learners, 2020. https://arxiv.org/pdf/2005.14165.pdf.
  6. Investigating efficiently extending transformers for long input summarization, 2022. https://arxiv.org/pdf/2208.04347.
  7. Sequence level training with recurrent neural networks, 2015. https://arxiv.org/pdf/1511.06732.
  8. Attention is all you need, 2017. https://arxiv.org/pdf/1706.03762.pdf.
  9. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework, 2022. https://arxiv.org/pdf/2202.03052.
  10. Semi-supervised multi-modal multi-instance multi-label deep network with optimal transport. IEEE Transactions on Knowledge and Data Engineering, 33(2):696–709, 2019.
  11. Exploiting cross-modal prediction and relation consistency for semisupervised image captioning. IEEE Transactions on Cybernetics, 2022.
  12. Semi-supervised multi-modal clustering and classification with incomplete modalities. IEEE Transactions on Knowledge and Data Engineering, 33(2):682–695, 2019.
  13. Rethinking label-wise cross-modal retrieval from a semantic sharing perspective. In IJCAI, pages 3300–3306, 2021.
  14. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization, 2019. https://arxiv.org/pdf/1912.08777.
  15. Extractive summarization as text matching, 2020. https://aclanthology.org/2020.acl-main.552.pdf.
Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.