Apple Intelligence Foundation Language Models: Tech Report 2025 (2507.13575v1)

Published 17 Jul 2025 in cs.LG and cs.AI

Abstract: We introduce two multilingual, multimodal foundation LLMs that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.

Summary

The paper introduces two key models: a 3B-parameter on-device model using 2-bit quantization and a scalable server model with a novel Parallel-Track Mixture-of-Experts architecture.
It details innovative optimizations such as KV cache sharing, interleaved global-local attention, and adaptive compression techniques that reduce computation and memory usage.
The study emphasizes practical deployment in multilingual and multimodal environments, enabling low-latency, privacy-centric AI integration and enhanced developer tooling.

Technical Overview and Implications of "Apple Intelligence Foundation LLMs: Tech Report 2025"

The "Apple Intelligence Foundation LLMs: Tech Report 2025" presents a comprehensive account of the design, training, optimization, and deployment of Apple's new generation of multilingual, multimodal foundation models. The report details two principal models: a highly efficient ~3B-parameter on-device model and a scalable server-side model based on a novel Parallel-Track Mixture-of-Experts (PT-MoE) architecture, both powering Apple Intelligence features across devices and services.

Model Architectures and Innovations

On-Device Model

The on-device model is engineered for low-latency, resource-constrained environments, leveraging several architectural and systems-level optimizations:

KV Cache Sharing: The transformer is partitioned into two blocks, with the second block omitting key/value projections and reusing the KV cache from the first. This reduces memory usage and prefill computation by 37.5%, directly improving time-to-first-token and overall inference efficiency.
2-bit Quantization-Aware Training (QAT): The model is quantized to 2 bits-per-weight, with QAT and a learnable scaling factor to mitigate quantization-induced degradation. The report introduces an iterative initialization for the scaling factor, improving stability and convergence in low-bit regimes.
Vision Encoder: The on-device model incorporates a ViTDet-L backbone with a Register-Window mechanism, enabling efficient extraction of both local and global visual features. The vision-language adaptation module compresses visual features into a fixed number of image tokens, aligning them with the LLM's token space.

Server Model

The server-side model is designed for high-throughput, scalable inference on Apple's Private Cloud Compute (PCC) platform:

Parallel-Track Transformer: The model is partitioned into multiple independent "tracks," each a stack of transformer layers, with synchronization only at block boundaries. This enables efficient parallel execution and reduces synchronization overhead compared to tensor parallelism.
PT-MoE: Mixture-of-Experts layers are introduced within each track, with local experts and top-k routing implemented via grouped GEMM. This design allows for efficient scaling and increased sparsity, reducing compute cost while maintaining model quality.
Interleaved Global-Local Attention: The architecture alternates between local (sliding window) and global attention layers, with the latter omitting positional embeddings to improve length generalization and reduce KV cache size for long-context inference.
Vision Encoder: The server model uses a ViT-g backbone, pre-trained with CLIP and further aligned with the LLM decoder for robust multimodal understanding.

Data Pipeline and Training

The models are trained on a diverse, large-scale multilingual and multimodal corpus, sourced via responsible web crawling (Applebot), licensed datasets, and high-quality synthetic data. Notable aspects include:

Advanced Web Crawling: Applebot employs headless rendering and interaction simulation to extract high-fidelity content, including dynamic and text-rich web pages.
LLM-Assisted Extraction and Filtering: LLMs are used in the data pipeline to improve main content extraction and filtering, reducing reliance on heuristics and increasing retention of high-quality tokens.
Synthetic Data Generation: In-house models generate detailed image captions and domain-specific QAs, enhancing the richness of multimodal training data.
Multilingual Expansion: The tokenizer is extended to 150k tokens, and mixture weights are adjusted to increase representation for low-resource languages.

Pre-training and Post-training

Pre-training: The server model is trained on 13.4T tokens using AXLearn, with a combination of data, sharded, and track parallelism. The on-device model employs a sparse-upcycling pipeline, distilling a dense model into a MoE variant and retraining with distillation loss, reducing training cost by 90%.
Continued Pre-training: Both models undergo further pre-training for code, math, multilingual, and long-context capabilities, as well as multimodal adaptation.
Supervised Fine-Tuning (SFT): SFT is scaled with a mix of human and synthetic data, emphasizing vision, reasoning, OCR, and grounding. High-resolution image understanding is enabled via a tiled input strategy.
RLHF: A distributed, asynchronous RL infrastructure is introduced, supporting diverse reward signals and reducing compute time by 75% compared to synchronous systems. Prompt selection is improved via a novel cohesion-based algorithm, yielding measurable gains on both auto and human benchmarks.

Model Compression and Optimization

On-Device Model: 2-bit QAT is used, with learnable scaling and EMA smoothing. The embedding table and KV cache are quantized to 4 and 8 bits, respectively.
Server Model: Weights are compressed post-training using Adaptive Scalable Texture Compression (ASTC), leveraging Apple GPU hardware for on-the-fly decompression at 3.56 bits-per-weight. LoRA adapters are used to recover quality lost during compression, with singular vectors separated prior to ASTC to minimize error.

Developer Framework

The Foundation Models framework exposes the on-device model to developers via a Swift-centric API, supporting:

Guided Generation: Constrained decoding is integrated with the Swift type system, allowing developers to generate structured data directly from model outputs.
Tool Calling: The framework guarantees structural correctness of tool calls, with post-training on tool-use data to improve reliability.
LoRA Adapter Fine-Tuning: A Python toolkit enables adapter training for advanced use cases, with seamless integration into the framework.
Integrated Tooling: Xcode support includes prompt engineering playgrounds, performance profiling, and simulator integration.

Evaluation and Results

The models are evaluated on standard benchmarks (MMLU, MMMLU, MGSM) and human preference studies:

On-Device Model: Outperforms Qwen-2.5-3B and Gemma-3-4B on MMLU/MMMLU, competitive on MGSM, but lags behind larger Qwen-3-4B.
Server Model: Slightly behind LLaMA 4 Scout and further behind Qwen-3-235B and GPT-4o, but achieves strong efficiency and cost metrics.
Compression Impact: 2-bit quantization and ASTC introduce minor quality drops (e.g., MMLU from 67.8 to 64.4 for on-device), but with substantial gains in throughput and memory footprint.
Multimodal Evaluation: The models demonstrate competitive or superior performance to comparably sized open models on image understanding tasks, with robust handling of text-rich and high-resolution images.

Responsible AI and Privacy

The report emphasizes a Responsible AI approach, with:

Data Privacy: No user data is used for training; all data is filtered for PII and unsafe content.
Locale-Specific Evaluation: Human evaluation is tailored to locale and language, with specific attention to cultural and linguistic appropriateness.
Safety Taxonomy and Guardrails: Built-in safety mechanisms, including content filtering and override lists, are integrated at both the model and framework levels.
Continuous Monitoring: User and developer feedback is actively incorporated into ongoing model and feature improvements.

Implications and Future Directions

Practical Implications

On-Device AI: The deployment of a performant, 2-bit quantized, ~3B-parameter model on consumer hardware sets a new bar for private, low-latency AI inference, enabling a broad range of applications without server dependency.
Developer Enablement: The Foundation Models framework lowers the barrier for integrating advanced generative AI into apps, with strong guarantees on output structure and safety.
Multimodal and Multilingual Reach: The models' robust handling of images and support for 16 languages position them for global, cross-modal applications, including accessibility, productivity, and creative tools.

Theoretical and Research Implications

PT-MoE Architecture: The Parallel-Track Mixture-of-Experts design offers a new paradigm for scaling transformers with reduced synchronization overhead, potentially influencing future large-scale model architectures.
Compression Techniques: The combination of QAT, ASTC, and LoRA-based quality recovery demonstrates a practical path for deploying large models in resource-constrained settings without prohibitive quality loss.
RLHF Infrastructure: The asynchronous, distributed RLHF system and cohesion-based prompt selection may inform best practices for scalable, efficient preference optimization in LLMs.

Future Developments

Further Model Scaling: The PT-MoE approach and hardware-aware compression techniques could be extended to even larger models as device and server hardware evolve.
Expanded Modalities: The vision-language adaptation pipeline is extensible to other modalities (e.g., audio, video), suggesting a path toward more general foundation models.
Federated and On-Device Training: With privacy-preserving infrastructure in place, future work may explore federated or on-device continual learning.
Safety and Alignment: The integration of locale-specific safety evaluation and guardrails will likely become standard as models are deployed globally and in sensitive domains.

Conclusion

The Apple Intelligence Foundation LLMs represent a significant advance in the practical deployment of efficient, multilingual, and multimodal foundation models. The technical innovations in architecture, training, compression, and developer tooling are tightly coupled with a responsible, privacy-centric approach, setting a strong precedent for future AI systems deployed at scale. The report's detailed exposition of both engineering and responsible AI practices provides a valuable reference for researchers and practitioners aiming to balance capability, efficiency, and safety in real-world AI deployments.