- The paper introduces FaultGPT, a model that leverages vision-language models to fuse vibration time-frequency images and text for industrial fault diagnosis.
- The methodology integrates a visual encoder based on CLIP, a multi-scale cross-modal image decoder (MCID), and a prompt learner to enhance diagnostic accuracy.
- Experimental evaluations using few-shot and zero-shot tests on benchmark datasets demonstrate robust performance and practical industrial applicability.
FaultGPT: Industrial Fault Diagnosis Question Answering System by Vision LLMs
Introduction
The paper presents FaultGPT, a novel model leveraging large vision-LLMs (LVLM) to automate industrial fault diagnosis through question answering. This approach addresses limitations in traditional methods, such as reliance on classification confidence scores and unimodal data sources, by integrating multimodal data for deeper semantic understanding. FaultGPT utilizes a large-scale instruction dataset featuring vibration time-frequency image-text label pairs and human instruction-ground truth pairs, significantly enhancing fault diagnosis capabilities in complex mechanical systems.
Figure 1: Inference process of FaultGPT compared to traditional fault diagnosis methods.
Methodology
FaultGPT is designed with a visual encoder, a multi-scale cross-modal image decoder (MCID), and a prompt learner. The visual encoder employs a pre-trained CLIP model with adapter modules for efficient multimodal fusion, projecting vibration signal features onto a semantic embedding space compatible with LLMs. MCID extracts fine-grained fault semantics, capturing localized fault information by leveraging cross-attention mechanisms on visual inputs. The prompt learner aligns extracted visual features with language generation prompts, enhancing the accuracy of fault diagnosis reports.
Figure 2: The overall training framework of the proposed FaultGPT. \ding{172}: Visual encoder, \ding{173}: MCID, \ding{174}: Prompt learner.
FDQA Instruction-Following Dataset
FaultGPT's instruction dataset was compiled from three major bearing fault datasets: CWRU, SCUT-FD, and Ottawa. This comprehensive dataset includes descriptions of time-frequency images, specifying fault types and characteristics, facilitating accurate fault detection. The instruction-following format enables LLMs to process multimodal inputs and generate relevant responses, optimizing fault diagnosis workflows.
Figure 3: Example of fault diagnosis instruction data.
Experimental Evaluation
FaultGPT was evaluated against several open-source LVLMs across datasets, yielding superior results in generating fault diagnosis reports. Few-shot and zero-shot evaluations demonstrated the model's robustness and adaptability to unseen scenarios, with detailed ablation studies confirming the efficacy of core components like MCID and prompt learner in enhancing performance.
Key performance metrics include:
Ablation Studies
The ablation study confirmed the significance of instruction tuning and the effectiveness of various loss functions (cross-entropy, focal, and dice) in training. The choice of wavelet basis in time-frequency transformations was assessed, demonstrating robust performance across different bases, with Morlet selected for primary experiments due to its stability.
Figure 5: Ablation Study of Instruction Tuning on CWRU Dataset.
User Interface Design
The FaultGPT system is equipped with a user-friendly interface enabling real-time interaction and fault diagnosis for non-expert users, showcasing its practical application in industry. Users can input time-frequency images and receive detailed diagnostic reports based on the model's analysis.
Figure 6: System demo showcasing an outer ring 2mm crack fault. The user interface is divided into four main sections: \ding{172} input area, \ding{173} user instruction area, \ding{174} report generation area, and \ding{175} MCID feature maps.
Conclusion
FaultGPT introduces a transformative approach to industrial fault diagnosis, harnessing LVLMs to perform detailed fault assessments beyond conventional methods. Future research will focus on expanding its application to compound fault diagnosis and other industrial domains, such as predicting remaining useful life, enhancing its versatility and impact across manufacturing sectors.
Figure 7: Mean Loss and Mean Token Accuracy for the training process.