CoTSRF: Utilize Chain of Thought as Stealthy and Robust Fingerprint of Large Language Models (2505.16785v1)

Published 22 May 2025 in cs.CR and cs.AI

Abstract: Despite providing superior performance, open-source LLMs are vulnerable to abusive usage. To address this issue, recent works propose LLM fingerprinting methods to identify the specific source LLMs behind suspect applications. However, these methods fail to provide stealthy and robust fingerprint verification. In this paper, we propose a novel LLM fingerprinting scheme, namely CoTSRF, which utilizes the Chain of Thought (CoT) as the fingerprint of an LLM. CoTSRF first collects the responses from the source LLM by querying it with crafted CoT queries. Then, it applies contrastive learning to train a CoT extractor that extracts the CoT feature (i.e., fingerprint) from the responses. Finally, CoTSRF conducts fingerprint verification by comparing the Kullback-Leibler divergence between the CoT features of the source and suspect LLMs against an empirical threshold. Various experiments have been conducted to demonstrate the advantage of our proposed CoTSRF for fingerprinting LLMs, particularly in stealthy and robust fingerprint verification.

Summary

CoTSRF: Implementing Stealthy and Robust Fingerprinting for LLMs

The paper "CoTSRF: Utilize Chain of Thought as Stealthy and Robust Fingerprint of LLMs" presents a novel methodology designed for fingerprinting LLMs through the utilization of the Chain of Thought (CoT). The research addresses inherent vulnerabilities in LLMs, particularly when they are used in malicious or unethical applications. It proposes a fingerprinting method that is both stealthy and robust, significantly improving upon the limitations of previous approaches.

Methodology

The authors introduce CoTSRF, a framework that leverages CoT for fingerprinting purposes. The unique insight is the characterization of an LLM's logical reasoning pattern as a fingerprint. CoTSRF employs a three-step process:

Response Collection: This module gathers responses from both the source LLM and benign LLMs using crafted CoT queries. A High-Temperature Data Augmentation (HTDA) strategy is employed to ensure that different, yet logically consistent, responses are generated by the source LLM. This method generates diverse positive responses while assembling distinct negative responses from benign LLMs, laying the foundation for contrastive learning.
CoT Feature Extraction: A contrastive learning framework is utilized to train a CoT extractor, ensuring accurate extraction of CoT features. The extracted features must differentiate between responses from the source and benign LLMs, employing a triplet margin loss function to optimize this extraction process.
Fingerprint Verification: The divergence between the CoT features of the source and suspect LLMs is analyzed. By employing Kullback-Leibler divergence and comparing it against an empirical threshold, CoTSRF confirms whether a suspect LLM infringes upon the source model.

Experimental Findings

The paper reports comprehensive experiments that underscore the advantages of CoTSRF, highlighting its effectiveness, reliability, stealthiness, and robustness:

Effectiveness: CoTSRF achieves a 100% True Positive Rate (TPR) across varying configurations, outperforming existing approaches such as TRAP, which showed degradation in detection capabilities under different conditions.
Reliability: The method maintains a False Positive Rate (FPR) of 0% in identifying both training and unseen benign LLMs, demonstrating impressive generalization.
Stealthiness: CoTSRF queries exhibit lower perplexity compared to existing methods, implying enhanced semantic coherence and therefore a reduced likelihood of detection and filtering by malicious users.
Robustness: Through simulated output perturbation attacks, CoTSRF consistently yielded high TPR values, evidencing resilience against modifications such as fine-tuning and temperature adjustments.

Implications and Future Work

The practical implications of CoTSRF are significant. By offering a stealthy and robust approach to LLM fingerprinting, the framework provides LLM providers with a reliable means of safeguarding their models against misuse. Theoretically, it contributes to the understanding of LLM behavior and architecture, offering insights into model reasoning capabilities as characterized by CoT.

Future developments could include expanding validation across larger and diverse LLM architectures, tuning parameters to accommodate evolving model paradigms, and exploring the integration of reinforcement learning to enhance fingerprint verification techniques.

In conclusion, this paper presents a sophisticated approach to LLM fingerprinting, promising both academic and practical advancements in securing AI models against unauthorized usage.