Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 75 tok/s
Gemini 2.5 Pro 42 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 98 tok/s Pro
Kimi K2 226 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Enhancing Generalization in Medical Visual Question Answering Tasks via Gradient-Guided Model Perturbation (2403.02707v1)

Published 5 Mar 2024 in cs.CV and cs.MM

Abstract: Leveraging pre-trained visual LLMs has become a widely adopted approach for improving performance in downstream visual question answering (VQA) applications. However, in the specialized field of medical VQA, the scarcity of available data poses a significant barrier to achieving reliable model generalization. Numerous methods have been proposed to enhance model generalization, addressing the issue from data-centric and model-centric perspectives. Data augmentation techniques are commonly employed to enrich the dataset, while various regularization approaches aim to prevent model overfitting, especially when training on limited data samples. In this paper, we introduce a method that incorporates gradient-guided parameter perturbations to the visual encoder of the multimodality model during both pre-training and fine-tuning phases, to improve model generalization for downstream medical VQA tasks. The small perturbation is adaptively generated by aligning with the direction of the moving average gradient in the optimization landscape, which is opposite to the directions of the optimizer's historical updates. It is subsequently injected into the model's visual encoder. The results show that, even with a significantly smaller pre-training image caption dataset, our approach achieves competitive outcomes on both VQA-RAD and SLAKE datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. “Multi-modal masked autoencoders for medical vision-and-language pre-training,” in MICCAI. 2022, vol. 13435 of Lecture Notes in Computer Science, pp. 679–689, Springer.
  2. “Self-supervised vision-language pretraining for medical visual question answering,” arXiv preprint arXiv:2211.13594, 2022.
  3. “Vqamix: Conditional triplet mixup for medical visual question answering,” IEEE Trans. Medical Imaging, pp. 3332–3343, 2022.
  4. “How well apply multimodal mixup and simple mlps backbone to medical visual question answering?,” in IEEE BIBM, 2022, pp. 2648–2655.
  5. “Dropout: a simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, 2014.
  6. “Ridge regression: Biased estimation for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, 1970.
  7. “Parametric noise injection: Trainable randomness to improve deep neural network robustness against adversarial attack,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. 2019, pp. 588–597, Computer Vision Foundation / IEEE.
  8. “Flipout: Efficient pseudo-independent weight perturbations on mini-batches,” in ICLR 2018,. 2018, OpenReview.net.
  9. “A dataset of clinically generated visual questions and answers about radiology images,” Scientific Data, Nov 2018.
  10. “Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering,” in 2021 IEEE ISBI, 2021, pp. 1650–1654.
  11. “Adam: A method for stochastic optimization,” arXiv: Learning,arXiv: Learning, Dec 2014.
  12. “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR 2021, Virtual Event, Austria, May 3-7, 2021. 2021, OpenReview.net.
  13. “Momentum contrast for unsupervised visual representation learning,” in CVPR 2020. 2020, pp. 9726–9735, Computer Vision Foundation / IEEE.
  14. Junnan Li and Ramprasaath R. Selvaraju et.al., “Align before fuse: Vision and language representation learning with momentum distillation,” in NeurIPS, 2021, pp. 9694–9705.
  15. “Radiology objects in context (ROCO): A multimodal image dataset,” in Springer, 2018, pp. 180–189.
  16. “Overcoming data limitation in medical visual question answering,” in MICCAI. 2019, pp. 522–530, Springer.
  17. “Contrastive pre-training and representation distillation for medical visual question answering based on radiology images,” in MICCAI. 2021, pp. 210–220, Springer.
  18. “Amam: an attention-based multimodal alignment model for medical visual question answering,” Knowledge-Based Systems, vol. 255, pp. 109763, 2022.
  19. “Pmc-vqa: Visual instruction tuning for medical visual question answering,” arXiv preprint arXiv:2305.10415, 2023.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube