LT-ViT: A Vision Transformer for multi-label Chest X-ray classification (2311.07263v1)

Published 13 Nov 2023 in cs.CV and cs.LG

Abstract: Vision Transformers (ViTs) are widely adopted in medical imaging tasks, and some existing efforts have been directed towards vision-language training for Chest X-rays (CXRs). However, we envision that there still exists a potential for improvement in vision-only training for CXRs using ViTs, by aggregating information from multiple scales, which has been proven beneficial for non-transformer networks. Hence, we have developed LT-ViT, a transformer that utilizes combined attention between image tokens and randomly initialized auxiliary tokens that represent labels. Our experiments demonstrate that LT-ViT (1) surpasses the state-of-the-art performance using pure ViTs on two publicly available CXR datasets, (2) is generalizable to other pre-training methods and therefore is agnostic to model initialization, and (3) enables model interpretability without grad-cam and its variants.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (4)

Umar Marikkar (7 papers)
Sara Atito (24 papers)
Muhammad Awais (59 papers)
Adam Mahdi (27 papers)

Citations (5)

View on Semantic Scholar

LT-ViT: A Vision Transformer for multi-label Chest X-ray classification (2311.07263v1)

Related Papers