Apriel-1.5-15B-Thinker: Efficient Multimodal Reasoning

Updated 4 October 2025

Apriel-1.5-15B-Thinker is a 15-billion parameter multimodal reasoning model that integrates vision and language using an innovative three-stage training process.
Its training combines depth upscaling, staged continual pretraining, and supervised fine-tuning to achieve frontier-level performance under modest computational demands.
The model employs a realigned projection network and targeted synthetic data to bridge performance gaps with larger models, making it ideal for resource-constrained deployments.

Apriel-1.5-15B-Thinker is a 15-billion parameter open-weights multimodal reasoning model that achieves frontier-level performance through a three-stage, data-centric training methodology rather than mere parameter scale. Designed to enable advanced multimodal reasoning under modest computational and deployment constraints, it closes the capability gap with larger models by coupling rigorous architectural upscaling with targeted continual pre-training and high-quality supervised fine-tuning. The following sections detail its architectural framework, training protocol, empirical performance, application scope, and open-source contributions.

1. Architectural Framework

Apriel-1.5-15B-Thinker leverages the Pixtral-12B architecture as its base, which integrates a vision encoder and a multimodal decoder analogous to LLaVA-style models. The architecture features:

Vision Encoder: Processes image inputs and generates visual feature representations.
Projection Network: A two-layer fully connected network aligns the vision encoder’s outputs to the decoder’s embedding space. Critically, this projection network is realigned post-depth upscaling to optimize multimodal integration.
Multimodal Decoder: Initially with 40 hidden layers, expanded to 48 through depth upscaling, achieving increased reasoning depth and abstraction without the cost of a full-scale retraining.

This pipeline facilitates efficient combination of vision and language features, minimizing redundancy and compute overhead. If $H$ is the original layer count, the upscaling step implements $H \mapsto H+\Delta H$ to expand the reasoning expressivity.

2. Progressive Training Methodology

Apriel-1.5-15B-Thinker departs from end-to-end massive pretraining by adopting a progressive, three-stage training regimen:

Stage 1: Depth Upscaling & Initial Training

Begins by augmenting decoder depth from 40 to 48 layers.
Trained on heterogeneous text corpora with up to 8192 token sequence length and linearly decaying learning rate ( $5 \times 10^{-5}$ ).
Intermediate checkpoints (sampled six times during upscaling) are averaged to stabilize the transition to the next stage.

Stage 2: Staged Continual Pretraining (CPT)

CPT Stage 1: Establishes foundational mathematical, scientific, and code reasoning (50% text-only tokens), replay (20%) from Stage 1, and multimodal reasoning tasks (30%) including document and chart understanding. Sequence lengths are expanded to 32,768 tokens.
CPT Stage 2: Emphasizes visual reasoning via a synthetic data generation pipeline containing:
- Image reconstruction (masking for scene priors)
- Visual matching (fine-grained discrimination tasks)
- Object detection and counting, with curriculum-driven difficulty adjustment
Training focuses only on updating the projection and decoder during this stage (vision encoder is frozen).

Stage 3: Supervised Fine-Tuning (SFT)

Curated high-quality text-only, instruction-response datasets emphasizing explicit reasoning traces in mathematics, coding, science, and tool use.
Sequence length up to 49,152 tokens with cosine learning rate decay.
Multiple SFT runs are merged by equal weight averaging, optimizing for both global performance and long-context tasks.

Notably, the model eschews reinforcement learning and preference optimization, providing an isolated assessment of the data-driven continual pre-training and SFT’s impact on multimodal reasoning.

3. Empirical Performance and Benchmarking

Apriel-1.5-15B-Thinker demonstrates highly competitive empirical performance:

Benchmark	Model Score	Comparator Score
Artificial Analysis Intelligence Index	52	DeepSeek-R1-0528: 52
10 Multimodal Vision Tasks	Mean: within 5 pts Gemini-2.5-Flash, Claude Sonnet-3.7	Gemini-2.5-Flash, Claude Sonnet-3.7 (varied sizes)
Math (AIME, MathVerse, etc.)	Strong performance; +9.65 on MathVerse (Vision Dominant) post-CPT	N/A

The Artificial Analysis Intelligence Index aggregates a suite of heterogeneous reasoning and perception tasks, validating that the model’s performance is robust across diverse domains. On vision-centric tasks, difficulty modulation and targeted synthetic data generation in CPT Stage 2 close the performance gap with models twice to three times as large, even in fine-grained spatial and compositional challenges.

4. Data-Centric Continual Pretraining

A defining feature of Apriel-1.5-15B-Thinker is its reliance on task-driven, synthetic, and replayed data during mid-training. This approach is characterized by:

Long-sequence training (up to 32,768/49,152 tokens), enabling accumulation and integration of extended reasoning chains.
Task-centric synthetic data in CPT Stage 2, systematically designed to address known deficits in spatial, compositional, and perception-related reasoning.
Replay of prior-stage tokens to reinforce stability and prevent catastrophic forgetting.

By freezing the vision encoder and incrementally refining only projection and decoder components during visual CPT, architectural invariants are preserved while selectively improving modality alignment. The supervised fine-tuning phase is strictly text-only, serving to “distill” explicit reasoning traces from diverse domains.

5. Applications and Implications

Apriel-1.5-15B-Thinker’s efficient design enables deployment in computationally constrained environments. The model’s multimodal reasoning capabilities make it applicable to:

On-premise and air-gapped installations requiring a balance of security, privacy, and performance.
Enterprise, scientific, and educational settings where both language and visual reasoning are essential.
Integration into open-source research platforms, rapid prototyping, and resource-limited infrastructure, as the model operates within single-GPU memory budgets.

A plausible implication is that the methodology used in Apriel-1.5-15B-Thinker offers a practical reference for organizations seeking to achieve strong multimodal performance without scaling to hundreds of billions of parameters or incurring massive compute costs.

6. Open-Source Release and Reproducibility

All model checkpoints, mid-training recipes, and evaluation scripts have been released under the MIT license, enabling unrestricted community audit and extension. The release includes:

Depth upscaling steps and engineering details.
Staged continual pretraining and synthetic data generation procedures.
SFT scripts and long-sequence handling techniques.
Comprehensive evaluation protocols for both the Artificial Analysis Intelligence Index and domain-specific vision benchmarks.

This open policy supports transparent benchmarking, critical ablations, and further research into architectural and curriculum innovations for efficient LLMs.

7. Context within Advanced Reasoning Models

Apriel-1.5-15B-Thinker’s training design echoes core findings from contemporary work on reasoning and multimodal LLMs: data-centric continual pretraining, curriculum structuring, and explicit reasoning trace integration are repeatedly shown to be essential for frontier-level performance in models of moderate scale (Seed et al., 10 Apr 2025, Wang et al., 16 Sep 2025, Zhang et al., 21 Jun 2025). Its avoidance of reinforcement learning and preference optimization further isolates the gains attributable solely to skillful dataset design and architectural scaling strategies. This places Apriel-1.5-15B-Thinker as a canonical example of efficient yet high-performing, open, multimodal reasoning models for the 2025 cycle.

In summary, Apriel-1.5-15B-Thinker represents an important open-source advance in multimodal reasoning LLMs, showing that staged, data-driven, and architecture-aware training can produce frontier-level capabilities in both text and vision tasks with moderate parameter budgets and compute resources (Radhakrishna et al., 1 Oct 2025).