Apple Intelligence Foundation Language Models: Tech Report 2025 (2507.13575v1)
Abstract: We introduce two multilingual, multimodal foundation LLMs that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.
Collections
Sign up for free to add this paper to one or more collections.
Summary
- The paper introduces two key models: a 3B-parameter on-device model using 2-bit quantization and a scalable server model with a novel Parallel-Track Mixture-of-Experts architecture.
- It details innovative optimizations such as KV cache sharing, interleaved global-local attention, and adaptive compression techniques that reduce computation and memory usage.
- The study emphasizes practical deployment in multilingual and multimodal environments, enabling low-latency, privacy-centric AI integration and enhanced developer tooling.
Technical Overview and Implications of "Apple Intelligence Foundation LLMs: Tech Report 2025"
The "Apple Intelligence Foundation LLMs: Tech Report 2025" presents a comprehensive account of the design, training, optimization, and deployment of Apple's new generation of multilingual, multimodal foundation models. The report details two principal models: a highly efficient ~3B-parameter on-device model and a scalable server-side model based on a novel Parallel-Track Mixture-of-Experts (PT-MoE) architecture, both powering Apple Intelligence features across devices and services.
Model Architectures and Innovations
On-Device Model
The on-device model is engineered for low-latency, resource-constrained environments, leveraging several architectural and systems-level optimizations:
- KV Cache Sharing: The transformer is partitioned into two blocks, with the second block omitting key/value projections and reusing the KV cache from the first. This reduces memory usage and prefill computation by 37.5%, directly improving time-to-first-token and overall inference efficiency.
- 2-bit Quantization-Aware Training (QAT): The model is quantized to 2 bits-per-weight, with QAT and a learnable scaling factor to mitigate quantization-induced degradation. The report introduces an iterative initialization for the scaling factor, improving stability and convergence in low-bit regimes.
- Vision Encoder: The on-device model incorporates a ViTDet-L backbone with a Register-Window mechanism, enabling efficient extraction of both local and global visual features. The vision-language adaptation module compresses visual features into a fixed number of image tokens, aligning them with the LLM's token space.
Server Model
The server-side model is designed for high-throughput, scalable inference on Apple's Private Cloud Compute (PCC) platform:
- Parallel-Track Transformer: The model is partitioned into multiple independent "tracks," each a stack of transformer layers, with synchronization only at block boundaries. This enables efficient parallel execution and reduces synchronization overhead compared to tensor parallelism.
- PT-MoE: Mixture-of-Experts layers are introduced within each track, with local experts and top-k routing implemented via grouped GEMM. This design allows for efficient scaling and increased sparsity, reducing compute cost while maintaining model quality.
- Interleaved Global-Local Attention: The architecture alternates between local (sliding window) and global attention layers, with the latter omitting positional embeddings to improve length generalization and reduce KV cache size for long-context inference.
- Vision Encoder: The server model uses a ViT-g backbone, pre-trained with CLIP and further aligned with the LLM decoder for robust multimodal understanding.
Data Pipeline and Training
The models are trained on a diverse, large-scale multilingual and multimodal corpus, sourced via responsible web crawling (Applebot), licensed datasets, and high-quality synthetic data. Notable aspects include:
- Advanced Web Crawling: Applebot employs headless rendering and interaction simulation to extract high-fidelity content, including dynamic and text-rich web pages.
- LLM-Assisted Extraction and Filtering: LLMs are used in the data pipeline to improve main content extraction and filtering, reducing reliance on heuristics and increasing retention of high-quality tokens.
- Synthetic Data Generation: In-house models generate detailed image captions and domain-specific QAs, enhancing the richness of multimodal training data.
- Multilingual Expansion: The tokenizer is extended to 150k tokens, and mixture weights are adjusted to increase representation for low-resource languages.
Pre-training and Post-training
- Pre-training: The server model is trained on 13.4T tokens using AXLearn, with a combination of data, sharded, and track parallelism. The on-device model employs a sparse-upcycling pipeline, distilling a dense model into a MoE variant and retraining with distillation loss, reducing training cost by 90%.
- Continued Pre-training: Both models undergo further pre-training for code, math, multilingual, and long-context capabilities, as well as multimodal adaptation.
- Supervised Fine-Tuning (SFT): SFT is scaled with a mix of human and synthetic data, emphasizing vision, reasoning, OCR, and grounding. High-resolution image understanding is enabled via a tiled input strategy.
- RLHF: A distributed, asynchronous RL infrastructure is introduced, supporting diverse reward signals and reducing compute time by 75% compared to synchronous systems. Prompt selection is improved via a novel cohesion-based algorithm, yielding measurable gains on both auto and human benchmarks.
Model Compression and Optimization
- On-Device Model: 2-bit QAT is used, with learnable scaling and EMA smoothing. The embedding table and KV cache are quantized to 4 and 8 bits, respectively.
- Server Model: Weights are compressed post-training using Adaptive Scalable Texture Compression (ASTC), leveraging Apple GPU hardware for on-the-fly decompression at 3.56 bits-per-weight. LoRA adapters are used to recover quality lost during compression, with singular vectors separated prior to ASTC to minimize error.
Developer Framework
The Foundation Models framework exposes the on-device model to developers via a Swift-centric API, supporting:
- Guided Generation: Constrained decoding is integrated with the Swift type system, allowing developers to generate structured data directly from model outputs.
- Tool Calling: The framework guarantees structural correctness of tool calls, with post-training on tool-use data to improve reliability.
- LoRA Adapter Fine-Tuning: A Python toolkit enables adapter training for advanced use cases, with seamless integration into the framework.
- Integrated Tooling: Xcode support includes prompt engineering playgrounds, performance profiling, and simulator integration.
Evaluation and Results
The models are evaluated on standard benchmarks (MMLU, MMMLU, MGSM) and human preference studies:
- On-Device Model: Outperforms Qwen-2.5-3B and Gemma-3-4B on MMLU/MMMLU, competitive on MGSM, but lags behind larger Qwen-3-4B.
- Server Model: Slightly behind LLaMA 4 Scout and further behind Qwen-3-235B and GPT-4o, but achieves strong efficiency and cost metrics.
- Compression Impact: 2-bit quantization and ASTC introduce minor quality drops (e.g., MMLU from 67.8 to 64.4 for on-device), but with substantial gains in throughput and memory footprint.
- Multimodal Evaluation: The models demonstrate competitive or superior performance to comparably sized open models on image understanding tasks, with robust handling of text-rich and high-resolution images.
Responsible AI and Privacy
The report emphasizes a Responsible AI approach, with:
- Data Privacy: No user data is used for training; all data is filtered for PII and unsafe content.
- Locale-Specific Evaluation: Human evaluation is tailored to locale and language, with specific attention to cultural and linguistic appropriateness.
- Safety Taxonomy and Guardrails: Built-in safety mechanisms, including content filtering and override lists, are integrated at both the model and framework levels.
- Continuous Monitoring: User and developer feedback is actively incorporated into ongoing model and feature improvements.
Implications and Future Directions
Practical Implications
- On-Device AI: The deployment of a performant, 2-bit quantized, ~3B-parameter model on consumer hardware sets a new bar for private, low-latency AI inference, enabling a broad range of applications without server dependency.
- Developer Enablement: The Foundation Models framework lowers the barrier for integrating advanced generative AI into apps, with strong guarantees on output structure and safety.
- Multimodal and Multilingual Reach: The models' robust handling of images and support for 16 languages position them for global, cross-modal applications, including accessibility, productivity, and creative tools.
Theoretical and Research Implications
- PT-MoE Architecture: The Parallel-Track Mixture-of-Experts design offers a new paradigm for scaling transformers with reduced synchronization overhead, potentially influencing future large-scale model architectures.
- Compression Techniques: The combination of QAT, ASTC, and LoRA-based quality recovery demonstrates a practical path for deploying large models in resource-constrained settings without prohibitive quality loss.
- RLHF Infrastructure: The asynchronous, distributed RLHF system and cohesion-based prompt selection may inform best practices for scalable, efficient preference optimization in LLMs.
Future Developments
- Further Model Scaling: The PT-MoE approach and hardware-aware compression techniques could be extended to even larger models as device and server hardware evolve.
- Expanded Modalities: The vision-language adaptation pipeline is extensible to other modalities (e.g., audio, video), suggesting a path toward more general foundation models.
- Federated and On-Device Training: With privacy-preserving infrastructure in place, future work may explore federated or on-device continual learning.
- Safety and Alignment: The integration of locale-specific safety evaluation and guardrails will likely become standard as models are deployed globally and in sensitive domains.
Conclusion
The Apple Intelligence Foundation LLMs represent a significant advance in the practical deployment of efficient, multilingual, and multimodal foundation models. The technical innovations in architecture, training, compression, and developer tooling are tightly coupled with a responsible, privacy-centric approach, setting a strong precedent for future AI systems deployed at scale. The report's detailed exposition of both engineering and responsible AI practices provides a valuable reference for researchers and practitioners aiming to balance capability, efficiency, and safety in real-world AI deployments.
Follow-up Questions
- How does the on-device model achieve low latency with 2-bit quantization and KV cache sharing?
- What are the key benefits of the Parallel-Track Mixture-of-Experts architecture in scalable inference?
- How do the compression techniques impact model quality and efficiency for both on-device and server models?
- In what ways does the integrated Swift-centric API facilitate safe tool calling and developer enablement?
- Find recent papers about parallel mixture-of-experts architectures.
Related Papers
- PaLM 2 Technical Report (2023)
- Yi: Open Foundation Models by 01.AI (2024)
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone (2024)
- Tele-FLM Technical Report (2024)
- Imp: Highly Capable Large Multimodal Models for Mobile Devices (2024)
- Qwen2 Technical Report (2024)
- The Llama 3 Herd of Models (2024)
- Apple Intelligence Foundation Language Models (2024)
- Kimi-VL Technical Report (2025)
- BitNet b1.58 2B4T Technical Report (2025)
Authors (395)
YouTube
alphaXiv
- Apple Intelligence Foundation Language Models: Tech Report 2025 (39 likes, 0 questions)