Detecting Voice Phishing with Precision: Fine-Tuning Small Language Models (2506.06180v1)

Published 6 Jun 2025 in cs.CL

Abstract: We develop a voice phishing (VP) detector by fine-tuning Llama3, a representative open-source, small LLM (LM). In the prompt, we provide carefully-designed VP evaluation criteria and apply the Chain-of-Thought (CoT) technique. To evaluate the robustness of LMs and highlight differences in their performance, we construct an adversarial test dataset that places the models under challenging conditions. Moreover, to address the lack of VP transcripts, we create transcripts by referencing existing or new types of VP techniques. We compare cases where evaluation criteria are included, the CoT technique is applied, or both are used together. In the experiment, our results show that the Llama3-8B model, fine-tuned with a dataset that includes a prompt with VP evaluation criteria, yields the best performance among small LMs and is comparable to that of a GPT-4-based VP detector. These findings indicate that incorporating human expert knowledge into the prompt is more effective than using the CoT technique for small LMs in VP detection.

Summary

The paper demonstrates that fine-tuning small language models with VP-specific prompts significantly boosts detection performance, nearing that of GPT-4-based systems.
It leverages parameter-efficient methods like LoRA and 8-bit quantization on the Llama3-8B model to overcome data scarcity and privacy concerns.
The study highlights that integrating domain-specific evaluation criteria and expert inputs can create secure, cost-effective voice phishing prevention solutions.

Analyzing Fine-Tuned Small LLMs for Voice Phishing Detection

The paper "Detecting Voice Phishing with Precision: Fine-Tuning Small LLMs" by Ju Yong Sim and Seong Hwan Kim, explores using open-source, small LLMs (SLMs) for detecting voice phishing (VP). The research aims to address the increasing global problem of VP scams through improved LLM techniques. The authors opt for an SLM approach to mitigate the high operational costs and privacy concerns associated with proprietary LLMs like OpenAI's GPT-3.5.

The primary focus of the research is the fine-tuning of the Llama3-8B model, a notable SLM, to function as a VP detector. Initially, the authors take pivotal steps by integrating domain-specific evaluation criteria and employing Chain-of-Thought (CoT) reasoning techniques. They contend that utilizing human expert input in prompts significantly improves SLM outcomes, proving to be more effective than merely employing CoT techniques. This fine-tuning is guided by accumulating an adversarial test dataset that comprises both actual and artificially generated VP and non-VP transcripts.

Key metrics examined in the paper indicate that the Llama3-8B model, when fine-tuned with VP-specific prompts, shows performance results comparable to those of the GPT-4-based VP detector. Notably, the paper emphasizes that models using coherent VP criteria outperformed those utilizing only CoT reasoning or a hybrid approach including both VP criteria and CoT.

Experimental Approach

Data Acquisition and Pre-processing: Data scarcity remains a primary hurdle in VP detection. The authors address this challenge by sourcing public VP transcripts and generating artificial ones. The inclusion of an adversarial dataset, which encompasses everyday conversations potentially misclassified as VP, provides a rigorous evaluation benchmark.
Fine-Tuning and Testing: The Llama3-8B model undergoes parameter-efficient fine-tuning employing Low-Rank Adaptation (LoRA) and 8-bit quantization techniques. The authors conduct thorough experimentation on various prompt designs that include or exclude VP evaluation criteria and CoT reasoning.
Accuracy and Performance Evaluation: The models' performances are assessed on traditional and adversarial datasets. The analysis underscores that fine-tuning utilizing VP-specific prompts significantly boosts the SLMs efficacy, achieving comparable results to the more parameter-heavy GPT-4-based systems. Conversely, models solely relying on CoT techniques without VP criteria demonstrated inferior performance.

Implications and Future Direction

This research highlights valuable insights that have far-reaching implications. By demonstrating that fine-tuning with articulate domain-specific prompts can maximize SLMs' capabilities, the paper advances machine language processing in security-sensitive applications such as VP detection. The potential to deploy lightweight models reduces risks associated with private data leakage to third-party services and cuts down on API-related expenditures.

For future work, the paper suggests that continued refinement in handling LLM constraints, like SLMs' capacity limitations, and advancing prompt optimization strategies might further bridge the gap between SLM and LLM performance. Furthermore, this may enable the integration of such models into mobile platforms, addressing the wider need for scalable, on-device AI solutions.

In essence, this research supports the broader AI community's utilization of accessible, open-source models to create effective, sustainable, and secure solutions in tackling real-world problems such as voice phishing.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers