Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MedICaT: A Dataset of Medical Images, Captions, and Textual References (2010.06000v1)

Published 12 Oct 2020 in cs.CV and cs.CL

Abstract: Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of figures in our dataset), with detailed text describing their content. Previous work studying figures in scientific papers focused on classifying figure content rather than understanding how images relate to the text. To address challenges in figure retrieval and figure-to-text alignment, we introduce MedICaT, a dataset of medical images in context. MedICaT consists of 217K images from 131K open access biomedical papers, and includes captions, inline references for 74% of figures, and manually annotated subfigures and subcaptions for a subset of figures. Using MedICaT, we introduce the task of subfigure to subcaption alignment in compound figures and demonstrate the utility of inline references in image-text matching. Our data and code can be accessed at https://github.com/allenai/medicat.

Citations (57)

Summary

  • The paper introduces MedICaT, a large dataset linking medical images with descriptive captions and inline references to enhance figure-text analysis.
  • It details a novel subfigure-subcaption alignment method using manual annotations and a transformer model, achieving an F1 score of 0.674.
  • The dataset improves image-text matching, paving the way for advances in medical image captioning, visual question answering, and literature search tools.

MedICaT: A Dataset of Medical Images, Captions, and Textual References

The paper "MedICaT: A Dataset of Medical Images, Captions, and Textual References" addresses a significant challenge in scientific document understanding, specifically in the medical domain, by introducing a comprehensive dataset designed to facilitate the paper of relationships between medical figures and related textual elements within biomedical literature. The authors present MedICaT, a dataset encompassing 217,060 images extracted from 131,410 open-access biomedical papers, providing a resource that links medical figures with descriptive captions, inline references, and detailed subfigure annotations.

Dataset Composition and Features

MedICaT focuses primarily on extracting and annotating compound figures, which constitute a substantial portion of scientific illustrations, particularly in medical publications. Approximately 75% of the figures in the MedICaT dataset are compound, containing multiple subfigures each with potentially unique captions. This complexity is supported by manual annotations for subfigure-subcaption alignment, involving 2069 figures, amounting to 7507 annotated pairs.

The dataset stands out due to its integration of inline references, present for 74% of the figures. These references offer additional contextual information which is not solely captured by the figure captions. This feature is critical for more accurate medical image analysis and retrieval tasks, where understanding the nuanced scientific context is paramount.

Tasks and Baseline Models

One of the core tasks proposed is subfigure to subcaption alignment, which poses a challenge due to substantial variability in how subfigures are referenced and described within the text. The authors introduce a baseline model using a pre-trained transformer architecture, achieving an F1 score of 0.674, against an inter-annotator agreement score of 0.89, indicating robustness but also highlighting room for improvement. The model utilizes sequential tagging methods with a CRF layer to handle the subcaption extraction process.

Additionally, the paper explores the task of image-text matching, employing models trained on both captions and inline references. The empirical results demonstrate that incorporating inline references enhances the retrieval accuracy, as evidenced by a Recall@1 rate improvement from 7.6% to 9.4%.

Implications and Future Directions

MedICaT represents a substantial step forward in the context of multimodal scientific data resources. By providing a rich dataset that aligns figures with comprehensive textual information, the work paves the way for advancements in several areas, such as automatic medical image captioning, visual question answering (VQA), and enhanced academic search tools.

The methodology for dataset construction is extensible to other scientific domains, suggesting a broader applicability beyond the immediate scope of medical sciences. As tools and models that leverage MedICaT's rich data become more sophisticated, there is potential for significant impact on clinical tools that rely on precise figure-text alignment for decision-support systems.

Conclusion

The MedICaT dataset fills a notable gap in the availability of resources that closely link medical imagery with pertinent textual data. By advancing tasks such as subfigure-subcaption alignment and image-text matching, it enables more nuanced research into vision-language interactions in scientific discourse. The baseline models presented, while effective, indicate opportunities for further model development to better exploit the dataset's full potential. This work represents a crucial step towards more integrative and contextually aware analyses of medical scientific literature, with implications that hold promise for both theoretical advancements and practical applications in artificial intelligence and medical informatics.