Papers
Topics
Authors
Recent
2000 character limit reached

Fine-tuned LLaMA3.2-3B Models

Updated 4 November 2025
  • Fine-tuned LLaMA3.2-3B is a dense Transformer model optimized with LoRA/QLoRA and quantization techniques for efficient domain adaptation.
  • It incorporates advanced features like SwiGLU, rotary positional encodings, and grouped query attention to support long-context processing and multilingual specialization.
  • The model achieves robust results in tasks such as AMR parsing, legal and medical adaptations, and vision-language integration while reducing computational resource demands.

Fine-tuned LLaMA3.2-3B refers to a range of approaches and results involving the adaptation of the LLaMA 3.2-3B parameter Transformer-based LLM for specific domains and tasks through techniques such as supervised fine-tuning, parameter-efficient adaptation, quantization, and retrieval-augmented pipelines. The following sections synthesize the technical foundations, methodology, and empirical performance of fine-tuned LLaMA3.2-3B variants, highlighting quantization and efficiency, multilingual and specialized domain adaptation, benchmarking, and practical considerations for deployment.

1. Core Architecture and Training Features

LLaMA3.2-3B is a dense Transformer model based on the Llama 3 series, with standard self-attention, SwiGLU activations, and Grouped Query Attention (GQA) to optimize inference efficiency and key-value cache size. The model features a 128K-token vocabulary (tiktoken with an enlarged lexicon for non-English coverage), rotary positional encodings (RoPE, base freq=500,000), and a context window of up to 128K tokens following continued pre-training (Grattafiori et al., 2024). The architecture is designed for computational stability with minimal deviation from Llama 2 at the parameterization level, enabling robust transfer and compatibility with state-of-the-art fine-tuning and quantization strategies.

Component Detail
Base Model Dense Transformer, SwiGLU, RoPE, GQA w/ 8 KV heads
Params/tokens/vocab ~3B, 128K-token vocab, 128K context even for 3B size
Training Data Mix General (50%), Math/Reasoning (25%), Programming (17%), Multilingual (8%)
Fine-tuning Recipe SFT + DPO, Reward Modeling for post-training
Quantization 4- and 8-bit quantization, with LoRA/QLoRA for efficiency

This model architecture allows for effective scaling of both parameter size and input context, providing a robust substrate for fine-tuning across various domains and methods.

2. Parameter-Efficient Fine-Tuning and Quantization

Parameter-efficient fine-tuning (PEFT) is widely adopted in practice for LLaMA3.2-3B, primarily to mitigate compute and storage bottlenecks. The leading approaches are Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA).

LoRA/QLoRA mechanism (Ansari et al., 6 May 2025, Mansha, 6 Oct 2025, Hou et al., 2024):

  • LoRA inserts trainable low-rank adapter matrices (A,BA,B) into the attention weights WW, so y=(W+BA)x+by = (W + BA)x + b.
  • QLoRA stores WW in 4-bit quantized form (NF4 or similar), further reducing VRAM requirements to approximately 0.5 GB per 1B parameters.
  • Training is restricted to the adapters; the base model remains frozen and in quantized form.
Model Training VRAM (3B) Trainable Params Notable Setting
Full fine-tuning >35 GB All Standard
QLoRA/LoRA (rank 8-16) 0.9–16 GB 0.75–2% 4-bit or mixed-precision

This enables local or commodity-GPU fine-tuning of LLaMA3.2-3B in resource-limited research or production settings. QLoRA has been shown to preserve generalization and, in domain-specific settings (e.g., medicine), can yield substantial performance gains on downstream benchmarks (Ansari et al., 6 May 2025, Mansha, 6 Oct 2025).

3. Multilingual and Domain-Specific Adaptation

3.1 Translation and Instruction-Following

A two-stage adaptation strategy is effective for aligning LLaMA3.2-3B to translation and instruction-following tasks (Zan et al., 2024):

  1. MLE Fine-Tuning: Standard supervised translation for base ability.
  2. Unlikelihood Training: Samples with incorrect direction in the instruction are generated; the model is trained to minimize P(target output∣wrong instruction)P(\text{target output} \mid \text{wrong instruction}) using unlikelihood loss.

LD(θ)=LDMLE(θ)+αLDUL(θ)\mathcal{L}_{\mathcal{D}}(\theta) = \mathcal{L}^{\mathrm{MLE}}_{\mathcal{D}}(\theta) + \alpha \mathcal{L}^{\mathrm{UL}}_{\mathcal{D}}(\theta)

Empirical results indicate a mean reduction of off-target translation by -53.3%, and improvement of +5.7 SacreBLEU and +16.4 BLEURT compared to vanilla translation-finetuned LLaMA models. The strategy is robust with respect to the mixing coefficient α\alpha and effective even in small-model regimes, making it suitable for LLaMA3.2-3B (Zan et al., 2024).

Fine-tuning in legal (Qasem et al., 2024), medical (Ansari et al., 6 May 2025, Hou et al., 2024, Mansha, 6 Oct 2025), and culture-specific (e.g., Traditional Chinese, Baltic/Nordic history) (Research et al., 23 Jan 2025, Kostiuk et al., 15 Jan 2025) domains commonly relies on quantization, low-rank PEFT, and instruction-following datasets.

  • Legal adaptation: Quantized 1B-parameter models fine-tuned with LoRA (rank 64) on 243k+ synthetic Q&A pairs, with loss converging to ~0.31 and strong performance in narrative and explanatory queries (Qasem et al., 2024).
  • Medical adaptation: QLoRA achieves accuracy improvements (e.g., MedMCQA: +5.5 points, MMLU-Anatomy: +3 points) and enables lightweight deployment with RAG for context-aware decision support (Ansari et al., 6 May 2025). Chain-of-thought reasoning can be enhanced with LoRA/QLoRA-parameterized adapters, providing qualitative interpretability gains on constrained GPU (15–16 GB RAM) (Mansha, 6 Oct 2025).
  • Cultural adaptation: Multilingual QA results for Baltic/Nordic tasks reveal that LLaMA3.2-3B achieves 0.34–0.50 accuracy (random guessing: 0.25), but trails larger open-source and closed models (Kostiuk et al., 15 Jan 2025). Domain and cultural proximity, or Nordic fine-tuning, does not by itself substantially improve performance at this scale.

4. Task-Specific Fine-Tuning: AMR Parsing and Vision-Language

Fine-tuned LLaMA3.2-3B has demonstrated strong performance on semantic parsing tasks such as Abstract Meaning Representation (AMR):

  • AMR Parsing: PEFT LoRA (rank 8, α\alpha=32, dropout 0.05) and 4-bit quantization enable full 3B model finetuning on the LDC2020T02 AMR3.0 corpus. The model achieves SMATCH F1=0.804 on the official test set, comparable to strong classical parsers (APT+Silver (IBM)), and approaching Graphene Smatch (MBSE) at 0.854 (Ho, 7 Aug 2025). LLaMA3.2-3B leads in semantic match, though models like Phi-3.5 produce fewer structurally invalid graphs at the highest depths.
Model SMATCH F1 (AMR3.0) Structural Error Rate (Deep Graphs)
LLaMA3.2-3B 0.804 0.7
SOTA Classical 0.804–0.854 –
Phi-3.5 0.779 0.3
  • Multimodal (Traditional Chinese, Vision-Language, Function Calling): The Breeze2 family extends LLaMA3.2-3B with ViT-based vision encoding, MLP projection, and extensive continued pre-training (up to ~900 GB Traditional Chinese text/data). Performance metrics for Breeze2-3B include MT-Bench-TW 4.33, TMMBench (vision) 38.3, comparable to or exceeding other open models in the region/language and competitive function-calling scores (82 overall, 59% relevance detection) (Research et al., 23 Jan 2025).

5. Implementation, Resource, and Deployment Considerations

Fine-tuned LLaMA3.2-3B can be deployed on commodity hardware, edge/mobile devices, or as part of hybrid pipelines:

  • Efficiency: With LoRA/QLoRA and 4-bit quantization, memory usage during fine-tuning and inference is drastically reduced (down to ~0.5 GB/1B params, practical to run 3B models on 8GB GPU (Qasem et al., 2024), or 15–16 GB for medical reasoning tasks (Mansha, 6 Oct 2025)).
  • Trade-offs: PEFT approaches (QLoRA/LoRA) preserve base fluency and factuality, while minimizing risk of catastrophic forgetting and catastrophic memory/compute spikes seen in full fine-tuning (Ansari et al., 6 May 2025, Mansha, 6 Oct 2025, Hou et al., 2024).
  • Mobile/embedded deployment: Breeze2-3B's mobile app deployment via ExecuTorch on MediaTek NPUs demonstrates feasibility at 6.87 GB RAM, pre-fill rate of 17 tokens/s (Research et al., 23 Jan 2025).
  • Privacy: Local fine-tuning on hospital/government hardware preserves data sovereignty, crucial for medical/legal settings, GDPR compliance, and controlled access to domain-specialized models (Qasem et al., 2024, Hou et al., 2024).

6. Limitations and Comparative Performance

While fine-tuned LLaMA3.2-3B models perform robustly on a variety of tasks and benchmark favorably against other open small models, several limitations are evident:

  • Model size constraints: Accuracy, world knowledge, and cultural/factual alignment for low-resource domains (e.g., Baltic history) are significantly below larger open/closed models; performance is at or just above random for four-choice MCQ (0.34–0.50) (Kostiuk et al., 15 Jan 2025).
  • Structured output (AMR, calculation, formatting): Semantic fidelity can match SOTA parsers (SMATCH F1), but structural validity may degrade with graph complexity (Ho, 7 Aug 2025). Calculation-heavy or format-specific queries remain weak in domain-tuned settings (e.g., calculation-based legal queries, complex list formatting) (Qasem et al., 2024).
  • Instruction/data quality: Robustness to imperfect data is improved with specialized PEFT approaches but may still require additional engineering (e.g., unlikelihood loss for translation direction (Zan et al., 2024)).
  • Resource for scaling: Performance gains from fine-tuning saturate past moderate dataset sizes; further scaling (parameter or dataset) is needed to close the gap with frontier models (Kostiuk et al., 15 Jan 2025).

7. Outlook and Application Guidance

Fine-tuned LLaMA3.2-3B serves as a practical platform for cost-efficient, domain- and language-adapted models, especially in constrained environments. Key guidelines include:

  • Use QLoRA/LoRA with appropriate rank (8–16) for target task efficiency and knowledge retention.
  • Incorporate domain-specific data preprocessing, careful data-mixing, and, where relevant, exploration of specialized loss functions (e.g., unlikelihood, contrastive) to mitigate off-target behaviors.
  • Quantize to 4-bit where hardware resources or deployment constraints demand.
  • For compliance-critical domains (healthcare, law), train and deploy models entirely on local infrastructure to ensure privacy.
  • Expect superior structural results with larger models; for best cultural knowledge, supplement fine-tuning with rich, domain-specific or augmented datasets.
  • Additional retrieval-augmentation or prompt engineering may be beneficial for high-stakes clinical, legal, or knowledge-heavy applications.

Fine-tuned LLaMA3.2-3B stands as an efficient, modular, and empirically validated foundation for domain specialization, robust instruction following, and deployment in scenarios that preclude the use of larger or cloud-based LLMs, within the empirical and practical bounds verified by current research (Grattafiori et al., 2024, Ansari et al., 6 May 2025, Qasem et al., 2024, Mansha, 6 Oct 2025, Hou et al., 2024, Ho, 7 Aug 2025, Kostiuk et al., 15 Jan 2025, Zan et al., 2024, Research et al., 23 Jan 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Fine-tuned LLaMA3.2-3B.