Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs (2503.01743v2)

Published 3 Mar 2025 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter LLM trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining various modalities without interference. For example, it now ranks first in the OpenASR leaderboard to date, although the LoRA component of the speech/audio modality has just 460 million parameters. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-LLMs on a wide range of tasks. Additionally, we experiment to further train Phi-4-Mini to enhance its reasoning capabilities. Despite its compact 3.8-billion-parameter size, this experimental version achieves reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.

Collections

Sign up for free to add this paper to a collection.

Sign Up

Summary

The paper introduces Phi-4-Mini and Phi-4-Multimodal, compact yet powerful models achieving high performance through synthetic data, an expanded vocabulary, GQA, and a Mixture-of-LoRAs approach for multimodal fusion.
Phi-4-Mini, a 3.8B parameter language model, demonstrates strong performance on math, coding, and reasoning tasks, often matching larger models and outperforming others in its size class.
Phi-4-Multimodal integrates text, vision, and speech/audio seamlessly using modality-specific LoRAs, achieving state-of-the-art results on multimodal benchmarks and ranking first on the OpenASR leaderboard.

The paper introduces Phi-4-Mini and Phi-4-Multimodal, which are compact language and multimodal models. Phi-4-Mini is a 3.8B parameter LLM (LM) trained using web data and synthetic data. It outperforms similarly sized open-source models on math and coding tasks. This performance is achieved through a synthetic data recipe emphasizing high-quality math and coding datasets. Phi-4-Mini has an expanded vocabulary size of 200K tokens and employs group query attention (GQA) for long-sequence generation.

Phi-4-Multimodal integrates text, vision, and speech/audio modalities into a single model. The approach uses LoRA adapters and modality-specific routers, allowing multiple inference modes without interference. The model ranks first in the OpenASR leaderboard. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-LLMs on various tasks.

The paper also explores further training Phi-4-Mini to enhance reasoning capabilities. This experimental version achieves reasoning performance comparable to DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.

Key contributions:

Unified Multi-Modality Support: Phi-4-Multimodal uses a mixture of LoRAs to extend multimodal capabilities while minimizing interference between modalities.
Language Performance: The LLMs achieve strong performance in NLU and generation for its size category.
Code Understanding and Generation: The LLMs perform well on code-related tasks within its size category.
Multi-Modal Capabilities: The model delivers strong performance across multi-modal tasks for its size category, demonstrating integration of images with text and speech.
Speech and Audio Performance: The model achieves strong performance on multilingual speech recognition and translation tasks and speech summarization capability.
Reasoning Capabilities: The reasoning-optimized version of Phi-4-Mini demonstrates strong reasoning abilities.

Model Architecture

The Phi-4-Mini series includes a LLM (Phi-4-Mini) and a multimodal model (Phi-4-Multimodal). All Phi-4-Mini models use the tokenizer $\text{o200k\_\text{base}}$ tiktoken with a vocabulary size of 200,064 and are based on a decoder-only Transformer and support a 128K context length based on LongRoPE.

LLM Architecture

Phi-4-Mini and Phi-4-Multimodal share the same LLM backbone. Phi-4-Mini consists of 32 Transformer layers with a hidden state size of 3,072. Each Transformer block includes an attention mechanism based on GQA, optimizing key and value memory (KV) cache usage for long-context generation. The model employs 24 query heads and 8 key/value heads. Additionally, a fractional RoPE dimension is used, ensuring that 25\% of the attention head dimension remains position-agnostic. The peak learning rate is determined using $LR^*(D)=BD^{-0.32}$ , where $B$ is a constant, and $D$ is the number of training tokens.

$LR^*(D)$ : Optimal learning rate as a function of $D$
$B$ : Constant tuned for the specific model
$D$ : Total number of training tokens

Multimodal Model Architecture

The Phi-4-Multimodal architecture adopts a mixture of LoRAs to support multi-modality use cases. Different LoRAs are trained to handle interactions between different modalities.

Modality Details

Vision modality: Implemented with a SigLIP-400M image encoder fine-tuned with LLM2CLIP, a projector, and a LoRA adaptor. The projector is a 2-layer MLP. The image encoder and projector introduce 440M parameters, while the vision adapter $LoRA_V$ $L o R A_{V}$ consumes another 370M parameters. A dynamic multi-crop strategy is used to process images with diverse resolutions effectively.

Given a target image, the crop number for each side is computed as $\lceil \frac{H}{C}\rceil \times \lceil \frac{W}{C}\rceil$ , where $H, W, C$ are the image height, width, and crop size, respectively.
- $H$ : Image height
- $W$ : Image width
- $C$ : Crop size
Speech and Audio Modality: The speech/audio inputs are 80-dim log-Mel filter-bank features with a 10ms frame rate. To enable Phi-4-Multimodal speech and audio functions, a pre-trained audio encoder and Phi-4-Mini are connected through an audio adapter. LoRA is applied on the language decoder. The introduced modules for the speech/audio modality include:
- An audio encoder consisting of 3 convolution layers and 24 conformer blocks with 1024 attention dimensions, 1536 feed-forward dimensions, and 16 attention heads.
- An audio projector, which is a 2-layer MLP that maps the 1024-dim speech features to the text embedding space of 3072 dimensions.
- $LoRA_A$ that has been applied to all attention and MLP layers in Phi-4-Mini with a rank of 320.
- The audio encoder and projector introduce 460M parameters, while $LoRA_A$ consumes another 460M parameters. The speech token rate is 80ms, indicating 750 tokens for 1-minute audio.

Training Pipeline

The multimodal training stages include vision training, speech/audio training, and vision-speech joint training.

Vision Training: Vision training follows a four-stage process:
1. Projector Alignment stage
2. Joint Vision Training stage
3. Generative Vision-Language Training stage
4. Multi-Frame Training stage
Speech and Audio Training: A two-stage paradigm is used for speech and audio training: speech/audio pre-training and post-training. In the pre-training stage, large-scale automatic speech recognition (ASR) data is used to align the audio encoder and Phi-4-Mini in the semantic space. After pre-training, the model is trained with speech and audio SFT samples as the speech post-training stage.
Vision-speech Joint Training: The vision-speech joint training is conducted after vision post-training and speech post-training. The vision adapter $LoRA_V$ , vision encoder, and the vision projector are fine-tuned.
Reasoning Training: The continued training of Phi-4-Mini for reasoning proceeds in three stages:
1. Pre-training on approximately 60 billion reasoning CoT tokens generated by frontier reasoning LLMs, after which rejection sampling is employed.
2. Fine-tuning on a dataset of around 200K high-quality CoT samples.
3. DPO training using a dataset of 300K preference samples.

Data and Training Details

Language Training Data

Compared with Phi-3.5-Mini, the quality of the pre-training data has been improved by: better data filtering, better math and coding data, better synthetic data, and better data mixture. The 5 trillion pre-training data corpus is larger and of higher quality compared to Phi-3.5-Mini.

Vision-Language Training Data

The Phi-4-Multimodal model's pre-training phase encompasses interleaved image-text documents, image-text pairs, image grounding data, synthetic datasets from OCR of PDFs and images, and synthesized datasets for chart comprehension. The pre-training process involves a total of 0.5T tokens. For supervised fine-tuning (SFT), a combination of a text SFT dataset, multimodal instruction tuning datasets, and large-scale in-house multimodal instruction tuning datasets were used.

Vision-Speech Training Data

For vision-speech data, the Phi-4-Multimodal model is trained on a set of synthetic vision-speech data, covering single-frame and multi-frame scenarios. A subset of vision-language SFT data is reused and an in-house text-to-speech (TTS) engine is run to convert the user queries from texts to audios.

Speech and Audio Training Data

The training data for speech/audio functions includes: pre-training data with ASR transcriptions, and post-training data to unlock the instruction-following capability of Phi-4-Multimodal with the speech/audio modality involved.

Evaluation

Multimodal Benchmarks

Phi-4-Multimodal shows improvements over Phi-3.5-Vision and outperforms baseline models of similar sizes. In chart understanding and science reasoning tasks, Phi-4-Multimodal surpasses Gemini and GPT-4o. On vision-speech benchmarks, Phi-4-Multimodal outperforms InternOmni and Gemini-2.0-Flash.

Speech and Audio Benchmarks

Phi-4-Multimodal achieves strong ASR and AST performance, surpassing WhisperV3 and SeamlessM4T-large-v2 on CommonVoice, FLEURS, OpenASR, and CoVoST2 test sets. Phi-4-Multimodal is 5.5\% relatively better in WER than the best model on the Huggingface OpenASR leaderboard. The summarization quality is close to that of GPT-4o.

Language Benchmarks

Language

Phi-4-Mini outperforms similar size models and is on par with the models with 2 times larger parameter counts. Phi-4-Mini excels on math and reasoning related benchmarks and shows significantly improved performance on instruction following and function calling and strong coding performance.

Coding

Across coding benchmarks, Phi-4-Mini outperforms all 3B sized models and 8B sized model except for Qwen2.5.

CoT Reasoning

The reasoning-enhanced model outperforms DeepSeek-R1-Distill-Llama-8B, Bespoke-Stratos-7B, and OpenThinker-7B, and achieves performance comparable to DeepSeek-R1-Distill-Qwen-7B.

Safety

Phi-4-Mini and Phi-4-Multimodal were developed following Microsoft’s responsible AI principles. The approach consisted of safety alignment in post-training, red-teaming, automated testing, and evaluations.

Text Safety

An independent red team identified areas of improvement during the post-training process. Systematic Safety evaluations were carried out, leveraging Microsoft's Azure AI Evaluation SDK. The models are on par with other models of similar size.

Audio safety

For the audio safety alignment of Phi-4-Multimodal, the audio safety datasets were obtained by performing TTS synthesis on text safety datasets. For audio safety evaluations, Microsoft’s Speech Fairness evaluation was used to verify that Speech-To-Text transcription worked across a variety of demographics. A custom evaluation was implemented to assess whether the model would infer Sensitive Attributes (SA's) from the voice of a user.

Vision safety

To assess model safety in scenarios involving both text and images, Microsoft’s Azure AI Evaluation SDK was utilized. Vision safety metrics of Phi-4-Multimodal were compared with Phi-3.5-Vision, the open-source models Llava-1.6 and Qwen-VL-Chat, as well as GPT4-V.