Llama-3.1-FoundationAI-SecurityLLM-Base-8B Technical Report (2504.21039v1)

Published 28 Apr 2025 in cs.CR and cs.AI

Abstract: As transformer-based LLMs increasingly permeate society, they have revolutionized domains such as software engineering, creative writing, and digital arts. However, their adoption in cybersecurity remains limited due to challenges like scarcity of specialized training data and complexity of representing cybersecurity-specific knowledge. To address these gaps, we present Foundation-Sec-8B, a cybersecurity-focused LLM built on the Llama 3.1 architecture and enhanced through continued pretraining on a carefully curated cybersecurity corpus. We evaluate Foundation-Sec-8B across both established and new cybersecurity benchmarks, showing that it matches Llama 3.1-70B and GPT-4o-mini in certain cybersecurity-specific tasks. By releasing our model to the public, we aim to accelerate progress and adoption of AI-driven tools in both public and private cybersecurity contexts.

Summary

The paper introduces a specialized cybersecurity LLM that leverages continued pretraining on 5.1 billion tokens to overcome the limitations of general LLMs in cyber threat intelligence.
It employs a custom data pipeline with advanced filtering, deduplication, and a transformer-based relevancy classifier to curate over 4 TiB of high-quality cybersecurity content.
The model outperforms larger LLMs on CTI benchmarks and demonstrates practical applications in SOC automation, proactive threat defense, and secure engineering enablement.

This technical report introduces Foundation-Sec-8B, a LLM specifically tailored for cybersecurity applications. Built upon the Llama 3.1-8B architecture, the model is enhanced through continued pretraining on a large, carefully curated cybersecurity corpus. The paper addresses the limitations of general-purpose LLMs in cybersecurity, such as restrictive safety features, scarcity of high-quality specialized data, hallucinations, and distribution shifts.

To develop Foundation-Sec-8B, the authors collected a vast dataset using a two-pronged approach: a wide-net internet crawler with a relevancy filter and custom scrapers targeting known high-quality cybersecurity sources. This process generated over 4 TiB of raw data, which was then processed through a scalable pipeline. The pipeline included text extraction from various formats, language filtering, and importantly, a custom-trained transformer-based relevancy classifier to identify cybersecurity-specific content, addressing the limitations of general web datasets like FineWeb which might filter out cybersecurity terms. Additional filtering removed low-quality data. The processed data underwent preparation steps including deduplication using an n-gram Bloom filter, PII replacement, and upsampling of high-quality Tactics, Techniques, and Procedures (TTP) data. The final training dataset consisted of approximately 5.1 billion tokens.

Foundation-Sec-8B was trained using continued pretraining on this dataset. Training utilized DeepSpeed on a multi-node cluster, employing the AdamW optimizer with a cosine decay learning rate schedule. Sequences were packed into 4096 tokens for efficiency.

The model's performance was evaluated on several cybersecurity benchmarks: CTIBench (MCQA and Root Cause Mapping), CyberMetric-500, and SecBench (English section). The evaluation methodology differed for pretrained models (5-shot prompting to infer task and format) and instruction-finetuned models (zero-shot with specific instructions, using regex for answer extraction), acknowledging that base models don't strictly follow instructions. The vLLM framework was used for efficient inference during evaluation. Baselines included Llama 3.1 variants and other cybersecurity-focused LLMs like SecurityLLM and Primus.

Results showed significant improvements over the base Llama 3.1-8B model on cybersecurity tasks. Foundation-Sec-8B achieved competitive performance with much larger models like Llama 3.1-70B and GPT-4o-mini, particularly on CTI-specific tasks like CTIBench-RCM, where it outperformed GPT-4o-mini and matched Llama 3.1-70B and WhiteRabbitNeo-V2-70B. While there was a small drop in general knowledge as measured by the MMLU benchmark (2.4 points compared to Llama 3.1-8B), this was consistent with observations in prior domain adaptation work, indicating that specialization didn't lead to severe catastrophic forgetting.

The report highlights practical, real-world use cases where Foundation-Sec-8B is being applied or piloted:

SOC Acceleration: Automating tasks like summarizing alerts, generating incident timelines, and drafting analyst reports to improve triage efficiency.
Proactive Threat Defense: Modeling attacker behavior by extracting TTPs from threat intelligence, prioritizing vulnerabilities, generating attack path hypotheses, and drafting penetration test reports. A specific example notes over 10% improvement in MITRE ATT&CK Technique extraction after fine-tuning Foundation-Sec-8B compared to a non-security model.
Engineering Enablement: Assisting security and platform teams with secure development guidance, configuration validation against best practices, compliance assessment, and policy analysis.

The authors conclude that Foundation-Sec-8B serves as a strong foundation for AI-driven tools in cybersecurity, demonstrating that targeted domain-aware training can enable smaller models to rival larger general-purpose ones on specialized tasks. They plan to release the model publicly to foster further research and adoption in the community and suggest future work includes scaling up the model and training data, extending capabilities to cybersecurity coding tasks, and integrating it into agentic systems.