Papers
Topics
Authors
Recent
Search
2000 character limit reached

Vicuna: Instruction-Tuned Open LLMs

Updated 2 June 2026
  • Vicuna models are open-source, instruction-tuned LLMs derived from Meta's LLaMA, optimized for versatile dialogue and interactive use.
  • They use parameter-efficient fine-tuning with LoRA adapters and quantization, enabling deployment on resource-constrained hardware.
  • The models achieve competitive benchmarking scores through iterative self-refinement and support both English and Chinese language tasks.

Vicuna is a family of open-source, instruction-tuned LLMs derived from Meta’s LLaMA architecture, designed to closely emulate the conversational abilities of proprietary systems like ChatGPT while maintaining accessibility and adaptability for a diverse set of research and deployment scenarios. The development of Vicuna and its variants emphasizes parameter-efficient adaptation, quantization for resource-constrained environments, and instruction-following capabilities across English and Chinese, making it central to ongoing research into democratized, high-performing LLMs (Ghosal et al., 2023, Shashidhar et al., 2023, Fan et al., 17 Apr 2025).

1. Foundation and Model Architecture

Vicuna-7B and Vicuna-13B are instruction-tuned derivatives of Meta's LLaMA-7B and LLaMA-13B, retaining the transformer-decoder backbone characterized by full-attention layers, rotary positional embeddings, and architecture scaling proportional to parameter count. For instance, the 7B variant utilizes 33 transformer layers, a hidden size of 4096, and 32 attention heads per layer. These models are further enhanced for instruction-following through system-specific fine-tuning strategies and the integration of low-rank adaptation (LoRA) modules. LoRA injects two low-rank matrices into each projection matrix WW, yielding W=W+ΔWW' = W + \Delta W with ΔW=AB\Delta W = AB for ARd×rA\in\mathbb{R}^{d\times r} and BRr×kB\in\mathbb{R}^{r\times k}, rmin(d,k)r \ll \min(d,k), thus enabling efficient parameter adaptation with only \sim0.05–0.1% additional parameters relative to the full model (Ghosal et al., 2023, Fan et al., 17 Apr 2025).

Subsequent variants, such as Chinese-Vicuna, extend token embeddings for Chinese, implement domain transfer via continued fine-tuning, and employ quantization-aware adaptation using QLoRA (4-bit and 8-bit group-wise quantization), accommodating deployment on resource-constrained devices (e.g., RTX-2080Ti) while maintaining performance (Fan et al., 17 Apr 2025).

2. Instruction Tuning and Data Sources

Instruction tuning in Vicuna centers on exposure to large-scale, diverse, instruction-formatted dialogue corpora. The foundational English Vicuna models were tuned on multi-turn dialogue datasets distilled from ChatGPT, including ShareGPT and other conversational logs. The Flacuna variant introduced by (Ghosal et al., 2023) demonstrates enhanced problem-solving capacities by fine-tuning Vicuna-13B on a curated FLANMINI dataset comprising \sim1.34 million examples. This dataset includes:

  • \sim1.008M FLAN subset instructions (FLAN2021, Public Pool of Prompts, Natural Instructions v2, Chain-of-Thought reasoning).
  • \sim200K code-centric tasks from CodeSearchNet, CodeContests, and APPS.
  • W=W+ΔWW' = W + \Delta W0132K distilled dialogues from GPT-4-Alpaca, Code-Alpaca, and ShareGPT.

Prompt diversity and style are maintained through randomization and conversion to Vicuna’s conversational format (“USER: … ASSISTANT: …”). The Chinese-Vicuna methodology merges BELLE (500K+ Chinese instruction pairs) and Guanaco (534K multilingual instructions), with domain-specific extensions for medical and legal prompts (Fan et al., 17 Apr 2025).

3. Parameter-Efficient Fine-Tuning and Quantization

Vicuna fine-tuning employs LoRA adapters on all W=W+ΔWW' = W + \Delta W1 and W=W+ΔWW' = W + \Delta W2 modules, with typical settings of rank W=W+ΔWW' = W + \Delta W3, scaling W=W+ΔWW' = W + \Delta W4, and dropout=0.05. Training is conducted using mixed precision (bfloat16) and moderate compute configurations (4W=W+ΔWW' = W + \Delta W5NVIDIA A6000 or RTX-2080Ti). In the Flacuna experiment (Ghosal et al., 2023), parameter-efficient fine-tuning allows for a single epoch over 1.34M examples, corresponding to approximately 6.55 million trainable parameters for Vicuna-13B (0.05% of its total).

The quantization pipeline in Chinese-Vicuna employs group-wise QLoRA. Parameters are quantized to 8-bit (7B) or 4-bit (13B), with quantization parameters W=W+ΔWW' = W + \Delta W6 computed per group of weights to minimize storage and peak VRAM requirements. Practically, this enables Vicuna-7B (quantized) to run on 4.13GB VRAM and Vicuna-13B on 7.41GB VRAM, supporting real-time inference (e.g., 8 tokens/s on RTX-2080Ti, 3 tokens/s CPU) (Fan et al., 17 Apr 2025, Shashidhar et al., 2023).

4. Performance, Benchmarking, and Refinement

Performance evaluation spans general NLP and specialized tasks, leveraging metrics from INSTRUCTEVAL, the Vicuna-Benchmark, and domain benchmarks:

Model Zero-shot Vicuna-Benchmark (%) External Benchmarks (ARC, HellaSwag, MMLU, TruthfulQA) Post-Refinement Gain VRAM (4-bit, GB)
Vicuna-7B 89.31 52.2 +11.74 pp / +25.4pp* 4.13
Vicuna-13B 94.53 53.7 +7.61 pp / >ChatGPT 7.41

*Open-ended/high-creativity tasks (Shashidhar et al., 2023).

Flacuna (Vicuna+FLANMINI) demonstrates 4–16 percentage-point improvements on held-out reasoning benchmarks (MMLU, BBH, CRASS), and a 10.8pp gain on the HHH “alignment” metric but slightly reduced HumanEval and open-ended writing scores. On Chinese-specific evaluations, Chinese-Vicuna outperforms the base Vicuna-7B on translation (+2.2 BLEU), code generation (+2.6% Pass@1), and medical QA (+15.6% accuracy) (Fan et al., 17 Apr 2025).

Iterative, domain-agnostic self-refinement—defined as generating a response, critiquing it, then outputting a refined version based on the critique—yields substantial further improvements. For example, Vicuna-13B attains a mean score of 101.72% post-refinement (relative to ChatGPT’s 100%), and Vicuna-7B improves from 84.18% to 92.96% (Shashidhar et al., 2023).

5. Cost-Performance Analysis and Deployment

Vicuna models are expressly engineered for deployment flexibility. PeRFICS (Performance, Refinement, and Inference Cost Score) is proposed as a composite metric that balances baseline and improved task performance, external benchmark scores, and VRAM-based inference cost:

W=W+ΔWW' = W + \Delta W7

This enables practitioners to make informed model selections for contexts ranging from high-throughput email response automation (Vicuna-7B on RTX 3060) to privacy-respecting on-premise code analysis (Vicuna-13B). Chinese-Vicuna integration into llama.cpp and provision of conversion tools further extends CPU-only and multi-GPU deployment (Fan et al., 17 Apr 2025, Shashidhar et al., 2023).

6. Domain Adaptation, Community Ecosystem, and Future Directions

Chinese-Vicuna expands the Vicuna paradigm through hybrid, Chinese-focused datasets, continued fine-tuning for specialized domains, and community-driven modular adapters for specific fields (healthcare, law, etc.). The project roadmap includes reinforcement learning from human feedback (RLHF) for alignment and retrieval-augmented generation for knowledge updating. A full-parameter fine-tuning cookbook is also disseminated for research scenarios requiring maximal adaptation fidelity (Fan et al., 17 Apr 2025).

A plausible implication is that the Vicuna model family and associated toolkits will remain foundational for downstream LLM research seeking a balance between transparent provenance, resource-adaptive deployment, and high-level instruction following, in both English and non-English settings. Ongoing results suggest that, with iterative self-refinement and judicious data curation, open LLMs can close or exceed the gap to leading proprietary systems, under cost and privacy constraints.


References:

(Ghosal et al., 2023) "Flacuna: Unleashing the Problem Solving Power of Vicuna using FLAN Fine-Tuning" (Shashidhar et al., 2023) "Democratizing LLMs: An Exploration of Cost-Performance Trade-offs in Self-Refined Open-Source Models" (Fan et al., 17 Apr 2025) "Chinese-Vicuna: A Chinese Instruction-following Llama-based Model"

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vicuna Model.