Fine-tuned LLaMA3.2-3B Models

Updated 4 November 2025

Fine-tuned LLaMA3.2-3B is a dense Transformer model optimized with LoRA/QLoRA and quantization techniques for efficient domain adaptation.
It incorporates advanced features like SwiGLU, rotary positional encodings, and grouped query attention to support long-context processing and multilingual specialization.
The model achieves robust results in tasks such as AMR parsing, legal and medical adaptations, and vision-language integration while reducing computational resource demands.

Fine-tuned LLaMA3.2-3B refers to a range of approaches and results involving the adaptation of the LLaMA 3.2-3B parameter Transformer-based LLM for specific domains and tasks through techniques such as supervised fine-tuning, parameter-efficient adaptation, quantization, and retrieval-augmented pipelines. The following sections synthesize the technical foundations, methodology, and empirical performance of fine-tuned LLaMA3.2-3B variants, highlighting quantization and efficiency, multilingual and specialized domain adaptation, benchmarking, and practical considerations for deployment.

1. Core Architecture and Training Features

LLaMA3.2-3B is a dense Transformer model based on the Llama 3 series, with standard self-attention, SwiGLU activations, and Grouped Query Attention (GQA) to optimize inference efficiency and key-value cache size. The model features a 128K-token vocabulary (tiktoken with an enlarged lexicon for non-English coverage), rotary positional encodings (RoPE, base freq=500,000), and a context window of up to 128K tokens following continued pre-training (Grattafiori et al., 2024). The architecture is designed for computational stability with minimal deviation from Llama 2 at the parameterization level, enabling robust transfer and compatibility with state-of-the-art fine-tuning and quantization strategies.

Component	Detail
Base Model	Dense Transformer, SwiGLU, RoPE, GQA w/ 8 KV heads
Params/tokens/vocab	~3B, 128K-token vocab, 128K context even for 3B size
Training Data Mix	General (50%), Math/Reasoning (25%), Programming (17%), Multilingual (8%)
Fine-tuning Recipe	SFT + DPO, Reward Modeling for post-training
Quantization	4- and 8-bit quantization, with LoRA/QLoRA for efficiency

This model architecture allows for effective scaling of both parameter size and input context, providing a robust substrate for fine-tuning across various domains and methods.

2. Parameter-Efficient Fine-Tuning and Quantization

Parameter-efficient fine-tuning (PEFT) is widely adopted in practice for LLaMA3.2-3B, primarily to mitigate compute and storage bottlenecks. The leading approaches are Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA).

LoRA/QLoRA mechanism (Ansari et al., 6 May 2025, Mansha, 6 Oct 2025, Hou et al., 2024):

LoRA inserts trainable low-rank adapter matrices ( $A,B$ ) into the attention weights $W$ , so $y = (W + BA)x + b$ .
QLoRA stores $W$ in 4-bit quantized form (NF4 or similar), further reducing VRAM requirements to approximately 0.5 GB per 1B parameters.
Training is restricted to the adapters; the base model remains frozen and in quantized form.

Model	Training VRAM (3B)	Trainable Params	Notable Setting
Full fine-tuning	>35 GB	All	Standard
QLoRA/LoRA (rank 8-16)	0.9–16 GB	0.75–2%	4-bit or mixed-precision

This enables local or commodity-GPU fine-tuning of LLaMA3.2-3B in resource-limited research or production settings. QLoRA has been shown to preserve generalization and, in domain-specific settings (e.g., medicine), can yield substantial performance gains on downstream benchmarks (Ansari et al., 6 May 2025, Mansha, 6 Oct 2025).

3. Multilingual and Domain-Specific Adaptation

3.1 Translation and Instruction-Following

A two-stage adaptation strategy is effective for aligning LLaMA3.2-3B to translation and instruction-following tasks (Zan et al., 2024):

MLE Fine-Tuning: Standard supervised translation for base ability.
Unlikelihood Training: Samples with incorrect direction in the instruction are generated; the model is trained to minimize $P(\text{target output} \mid \text{wrong instruction})$ using unlikelihood loss.

$\mathcal{L}_{\mathcal{D}}(\theta) = \mathcal{L}^{\mathrm{MLE}}_{\mathcal{D}}(\theta) + \alpha \mathcal{L}^{\mathrm{UL}}_{\mathcal{D}}(\theta)$

Empirical results indicate a mean reduction of off-target translation by -53.3%, and improvement of +5.7 SacreBLEU and +16.4 BLEURT compared to vanilla translation-finetuned LLaMA models. The strategy is robust with respect to the mixing coefficient $\alpha$ and effective even in small-model regimes, making it suitable for LLaMA3.2-3B (Zan et al., 2024).

3.2 Legal, Medical, and Cultural Domains

Fine-tuning in legal (Qasem et al., 2024), medical (Ansari et al., 6 May 2025, Hou et al., 2024, Mansha, 6 Oct 2025), and culture-specific (e.g., Traditional Chinese, Baltic/Nordic history) (Research et al., 23 Jan 2025, Kostiuk et al., 15 Jan 2025) domains commonly relies on quantization, low-rank PEFT, and instruction-following datasets.

Legal adaptation: Quantized 1B-parameter models fine-tuned with LoRA (rank 64) on 243k+ synthetic Q&A pairs, with loss converging to ~0.31 and strong performance in narrative and explanatory queries (Qasem et al., 2024).
Medical adaptation: QLoRA achieves accuracy improvements (e.g., MedMCQA: +5.5 points, MMLU-Anatomy: +3 points) and enables lightweight deployment with RAG for context-aware decision support (Ansari et al., 6 May 2025). Chain-of-thought reasoning can be enhanced with LoRA/QLoRA-parameterized adapters, providing qualitative interpretability gains on constrained GPU (15–16 GB RAM) (Mansha, 6 Oct 2025).
Cultural adaptation: Multilingual QA results for Baltic/Nordic tasks reveal that LLaMA3.2-3B achieves 0.34–0.50 accuracy (random guessing: 0.25), but trails larger open-source and closed models (Kostiuk et al., 15 Jan 2025). Domain and cultural proximity, or Nordic fine-tuning, does not by itself substantially improve performance at this scale.

4. Task-Specific Fine-Tuning: AMR Parsing and Vision-Language

Fine-tuned LLaMA3.2-3B has demonstrated strong performance on semantic parsing tasks such as Abstract Meaning Representation (AMR):

AMR Parsing: PEFT LoRA (rank 8, $\alpha$ =32, dropout 0.05) and 4-bit quantization enable full 3B model finetuning on the LDC2020T02 AMR3.0 corpus. The model achieves SMATCH F1=0.804 on the official test set, comparable to strong classical parsers (APT+Silver (IBM)), and approaching Graphene Smatch (MBSE) at 0.854 (Ho, 7 Aug 2025). LLaMA3.2-3B leads in semantic match, though models like Phi-3.5 produce fewer structurally invalid graphs at the highest depths.

Model	SMATCH F1 (AMR3.0)	Structural Error Rate (Deep Graphs)
LLaMA3.2-3B	0.804	0.7
SOTA Classical	0.804–0.854	–
Phi-3.5	0.779	0.3

Multimodal (Traditional Chinese, Vision-Language, Function Calling): The Breeze2 family extends LLaMA3.2-3B with ViT-based vision encoding, MLP projection, and extensive continued pre-training (up to ~900 GB Traditional Chinese text/data). Performance metrics for Breeze2-3B include MT-Bench-TW 4.33, TMMBench (vision) 38.3, comparable to or exceeding other open models in the region/language and competitive function-calling scores (82 overall, 59% relevance detection) (Research et al., 23 Jan 2025).

5. Implementation, Resource, and Deployment Considerations

Fine-tuned LLaMA3.2-3B can be deployed on commodity hardware, edge/mobile devices, or as part of hybrid pipelines:

Efficiency: With LoRA/QLoRA and 4-bit quantization, memory usage during fine-tuning and inference is drastically reduced (down to ~0.5 GB/1B params, practical to run 3B models on 8GB GPU (Qasem et al., 2024), or 15–16 GB for medical reasoning tasks (Mansha, 6 Oct 2025)).
Trade-offs: PEFT approaches (QLoRA/LoRA) preserve base fluency and factuality, while minimizing risk of catastrophic forgetting and catastrophic memory/compute spikes seen in full fine-tuning (Ansari et al., 6 May 2025, Mansha, 6 Oct 2025, Hou et al., 2024).
Mobile/embedded deployment: Breeze2-3B's mobile app deployment via ExecuTorch on MediaTek NPUs demonstrates feasibility at 6.87 GB RAM, pre-fill rate of 17 tokens/s (Research et al., 23 Jan 2025).
Privacy: Local fine-tuning on hospital/government hardware preserves data sovereignty, crucial for medical/legal settings, GDPR compliance, and controlled access to domain-specialized models (Qasem et al., 2024, Hou et al., 2024).

6. Limitations and Comparative Performance

While fine-tuned LLaMA3.2-3B models perform robustly on a variety of tasks and benchmark favorably against other open small models, several limitations are evident:

Model size constraints: Accuracy, world knowledge, and cultural/factual alignment for low-resource domains (e.g., Baltic history) are significantly below larger open/closed models; performance is at or just above random for four-choice MCQ (0.34–0.50) (Kostiuk et al., 15 Jan 2025).
Structured output (AMR, calculation, formatting): Semantic fidelity can match SOTA parsers (SMATCH F1), but structural validity may degrade with graph complexity (Ho, 7 Aug 2025). Calculation-heavy or format-specific queries remain weak in domain-tuned settings (e.g., calculation-based legal queries, complex list formatting) (Qasem et al., 2024).
Instruction/data quality: Robustness to imperfect data is improved with specialized PEFT approaches but may still require additional engineering (e.g., unlikelihood loss for translation direction (Zan et al., 2024)).
Resource for scaling: Performance gains from fine-tuning saturate past moderate dataset sizes; further scaling (parameter or dataset) is needed to close the gap with frontier models (Kostiuk et al., 15 Jan 2025).

7. Outlook and Application Guidance

Fine-tuned LLaMA3.2-3B serves as a practical platform for cost-efficient, domain- and language-adapted models, especially in constrained environments. Key guidelines include:

Use QLoRA/LoRA with appropriate rank (8–16) for target task efficiency and knowledge retention.
Incorporate domain-specific data preprocessing, careful data-mixing, and, where relevant, exploration of specialized loss functions (e.g., unlikelihood, contrastive) to mitigate off-target behaviors.
Quantize to 4-bit where hardware resources or deployment constraints demand.
For compliance-critical domains (healthcare, law), train and deploy models entirely on local infrastructure to ensure privacy.
Expect superior structural results with larger models; for best cultural knowledge, supplement fine-tuning with rich, domain-specific or augmented datasets.
Additional retrieval-augmentation or prompt engineering may be beneficial for high-stakes clinical, legal, or knowledge-heavy applications.

Fine-tuned LLaMA3.2-3B stands as an efficient, modular, and empirically validated foundation for domain specialization, robust instruction following, and deployment in scenarios that preclude the use of larger or cloud-based LLMs, within the empirical and practical bounds verified by current research (Grattafiori et al., 2024, Ansari et al., 6 May 2025, Qasem et al., 2024, Mansha, 6 Oct 2025, Hou et al., 2024, Ho, 7 Aug 2025, Kostiuk et al., 15 Jan 2025, Zan et al., 2024, Research et al., 23 Jan 2025).