Progressive Alignment with VLM-LLM Feature to Augment Defect Classification for the ASE Dataset (2404.05183v1)
Abstract: Traditional defect classification approaches are facing with two barriers. (1) Insufficient training data and unstable data quality. Collecting sufficient defective sample is expensive and time-costing, consequently leading to dataset variance. It introduces the difficulty on recognition and learning. (2) Over-dependence on visual modality. When the image pattern and texture is monotonic for all defect classes in a given dataset, the performance of conventional AOI system cannot be guaranteed. In scenarios where image quality is compromised due to mechanical failures or when defect information is inherently difficult to discern, the performance of deep models cannot be guaranteed. A main question is, "how to solve those two problems when they occur at the same time?" The feasible strategy is to explore another feature within dataset and combine an eminent vision-LLM (VLM) and Large-LLM with their astonishing zero-shot capability. In this work, we propose the special ASE dataset, including rich data description recorded on image, for defect classification, but the defect feature is uneasy to learn directly. Secondly, We present the prompting for VLM-LLM against defect classification with the proposed ASE dataset to activate extra-modality feature from images to enhance performance. Then, We design the novel progressive feature alignment (PFA) block to refine image-text feature to alleviate the difficulty of alignment under few-shot scenario. Finally, the proposed Cross-modality attention fusion (CMAF) module can effectively fuse different modality feature. Experiment results have demonstrated our method's effectiveness over several defect classification methods for the ASE dataset.
- CnOCR. https://github.com/breezedeus/cnocr.
- Multimodal machine learning: A survey and taxonomy. IEEE Transactions of Pattern Analysis and Machine Intelligence (TPAMI), pages 1–20, 2017.
- Anomalib: A deep learning library for anomaly detection. arXiv preprint, arXiv:2202.08341, 2022.
- A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. The Visual Computer, 38(8):2939–2970, 2022.
- Image-based surface defect detection using deep learning: A review. Journal of Computing and Information Science in Engineering, 2021.
- Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pages 839–847, 2018.
- Adaptive cross-modal few-shot learning. In Advances in neural information processing systems (NeurlPS), 2019.
- Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. Pattern Analysis and Machine Intelligence (TPAMI), 2023.
- Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Xinghui Dong; Christopher J. Taylor; Tim F. Cootes. Defect classification and detection using a multitask deep one-class cnn. IEEE Transactions on Automation Science and Engineering (TASE), 2021.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint, arXiv:2305.06500, 2023.
- Segmentation-based deep-learning approach for surface-defect detection. 2020.
- Alexandre Alahi Dongxu Guo, Taylor Mordan. Pedestrian stop and go forecasting with hybrid feature fusion. In Proceedings of the International Conference on Robotics and Automation (ICRA), 2022.
- Krystian Mikolajczyk Dylan Auty. Learning to prompt clip for monocular depth estimation: Exploring the limits of human language. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Semantic anomaly detection with large language models. Auton. Robots, page 1035–1055, 2023.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), pages 1877–1901, 2020.
- Flamingo: a visual language model for few-shot learning. In Advances in neural information processing systems (NeurlPS), 2022.
- Dynamic feature queue for surveillance face anti-spoofing via progressive training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
- Meta-learning convolutional neural architectures for multi-target concrete defect classification with the concrete defect bridge image dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint, arXiv:2307.09288, 2023b.
- Xdnet: A few-shot meta-learning approach for cross-domain visual inspection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2023c.
- Preaugnet: improve data augmentation for industrial defect classification with small-scale training data. Journal of Intelligent Manufacturing, 2023.
- A survey on deep learning for multimodal data fusion. Neural Computation, 2020.
- Sewer-ml: A multi-label sewer defect classification dataset and benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13456–13467, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Gradient boost tree network based on extensive feature analysis for popularity prediction of social posts. In Proceedings of the 31st ACM International Conference on Multimedia (ACMMM), page 9451–9455, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint, arXiv:2302.13971, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint, arXiv:1810.04805, 2018.
- Mixed supervision for surface-defect detection: from weakly to fully supervised learning. 2021.
- Visual prompt tuning. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
- Convolutional ensembling based few-shot defect detection technique. In Proceedings of the Thirteenth Indian Conference on Computer Vision, Graphics and Image Processing, New York, NY, USA, 2023.
- Maple: Multi-modal prompt learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Align before fuse: Vision and language representation learning with momentum distillation. In Advances in neural information processing systems (NeurIPS), pages 9694–9705, 2021.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International conference on machine learning (ICML), pages 12888–12900, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint, arXiv:2301.12597, 2023.
- Iterative prompt learning for unsupervised backlit image enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8094–8103, 2023.
- Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE Transactions on Image Processing (TIP), 32:3054–3065, 2023.
- Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
- Fixing weight decay regularization in adam. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
- David G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPR).
- Fully convolutional cross-scale-flows for image-based defect detection. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2022.
- Mixture of experts: a literature survey. Artificial Intelligence Review, 2014.
- Sysko-Romańczuk S. Pawłowski M, Wróblewska A. Effective techniques for multimodal data fusion: A comparative analysis. Sensors (Basel), 2023.
- Multi-modal fusion transformer for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Automatic detection and classification of sewer defects via hierarchical deep learning. IEEE Transactions on Automation Science and Engineering (TASE), 2021.
- Transmef: A transformer-based multi-exposure image fusion framework using self-supervised multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2126–2134, 2022.
- Aishwarya Agrawal Rabiul Awal, Le Zhang. Investigating prompting techniques for zero- and few-shot visual question answering. arXiv preprint, arXiv:2307.09288, 2024.
- Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning (ICML), pages 8748–8763, 2021.
- Crepe: Learnable prompting with clip improves visual relationship prediction. arXiv preprint, arXiv:2307.04838, 2023.
- Contrastive learning with hard negative samples. In International Conference on Learning Representations, 2021.
- Better aggregation in test-time augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1214–1223, 2021.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint, arXiv:1409.1556, 2014.
- Automated surface defect detection framework using machine vision and convolutional neural networks. In Journal of Intelligent Manufacturing, 2023.
- Underground sewer pipe condition assessment based on convolutional neural networks. Automation in Construction, 2022.
- Segmentation-based deep-learning approach for surface-defect detection. pages 759–776, 2020.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International conference on machine learning (ICML), pages 6105–6114, 2019.
- Training data-efficient image transformers and distillation through attention. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021.
- Deep multimodal fusion by channel exchanging. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Channel exchanging networks for multimodal and multitask dense image prediction. IEEE Transaction on Pattern Analysis and Machine Intelligence (TPAMI), 2022.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint, arXiv:2305.11175, 2023.
- Learning dynamic alignment via meta-filter for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Dynamic multimodal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), Multi-Modal Learning and Applications Workshop, 2023.
- Learning to adapt clip for few-shot monocular depth estimation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2024.
- Surface defect detection methods for industrial products: A review. Applied Sciences, 2021.
- Few-shot defect detection using feature enhancement and image generation for manufacturing quality inspection. In Applied Intelligence, 2024.
- Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Prompting large language models with answer heuristics for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPR), 2023.
- Detecting text in natural image with connectionist text proposal network. 2016.
- Ning Yan Zhonghe Ren, Fengzhou Fang and You Wu. State of the art in defect detection based on machine vision. 2022.
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022a.
- Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV), 2022b.
- Prophet: Prompting large language models with complementary answer heuristics for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPR), 2023.
- Understanding why vit trains badly on small datasets: An intuitive perspective. arXiv preprint arXiv:2302.03751, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.