Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 171 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 43 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 173 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Apriel-1.5-15b-Thinker (2510.01141v1)

Published 1 Oct 2025 in cs.AI

Abstract: We present Apriel-1.5-15B-Thinker, a 15-billion parameter open-weights multimodal reasoning model that achieves frontier-level performance through training design rather than sheer scale. Starting from Pixtral-12B, we apply a progressive three-stage methodology: (1) depth upscaling to expand reasoning capacity without pretraining from scratch, (2) staged continual pre-training that first develops foundational text and vision understanding, then enhances visual reasoning through targeted synthetic data generation addressing spatial structure, compositional understanding, and fine-grained perception, and (3) high-quality text-only supervised fine-tuning on curated instruction-response pairs with explicit reasoning traces spanning mathematics, coding, science, and tool use. Notably, our model achieves competitive results without reinforcement learning or preference optimization, isolating the contribution of our data-centric continual pre-training approach. On the Artificial Analysis Intelligence Index, Apriel-1.5-15B-Thinker attains a score of 52, matching DeepSeek-R1-0528 despite requiring significantly fewer computational resources. Across ten image benchmarks, its performance is on average within five points of Gemini-2.5-Flash and Claude Sonnet-3.7, a key achievement for a model operating within single-GPU deployment constraints. Our results demonstrate that thoughtful mid-training 2 design can close substantial capability gaps without massive scale, making frontier-level multimodal reasoning accessible to organizations with limited infrastructure. We release the model checkpoint, all training recipes, and evaluation protocols under the MIT license to to advance open-source research.

Summary

  • The paper introduces a 15B-parameter multimodal reasoning model that achieves competitive performance using a data-centric mid-training pipeline.
  • It employs depth upscaling, staged continual pretraining with synthetic augmentation, and high-quality supervised fine-tuning to optimize both text and vision capabilities.
  • Empirical results across diverse benchmarks demonstrate a superior cost-to-intelligence trade-off, making advanced multimodal reasoning accessible on limited hardware.

Apriel-1.5-15B-Thinker: Data-Centric Mid-Training for Efficient Frontier-Level Multimodal Reasoning

Apriel-1.5-15B-Thinker introduces a 15B parameter open-weights multimodal reasoning model that achieves competitive performance with state-of-the-art proprietary and open-source systems, emphasizing training methodology over scale. The model demonstrates that a carefully designed mid-training pipeline—comprising depth upscaling, staged continual pretraining (CPT), and high-quality supervised fine-tuning (SFT)—can close substantial capability gaps without resorting to massive parameter counts or expensive RL-based preference optimization. This approach enables deployment within single-GPU constraints, making frontier-level multimodal reasoning accessible to organizations with limited computational resources. Figure 1

Figure 1: Apriel-1.5-15B-Thinker model architecture illustration.


Model Architecture and Upscaling

Apriel-1.5-15B-Thinker builds upon the Pixtral-12B base, which follows the LLaVA architecture: a vision encoder connected to a multimodal decoder via a two-layer projection network. The initial upscaling increases the decoder depth from 40 to 48 layers, balancing compute, latency, and performance for single-GPU deployability. The upscaling phase leverages a diverse corpus, including replay data and high-quality domain-specific tokens, followed by projection network realignment using multimodal datasets. Training employs long sequence lengths (up to 8192 tokens) and checkpoint averaging for stability.

This architectural strategy enables efficient capacity expansion without pretraining from scratch, preserving the strengths of the base model while facilitating advanced multimodal reasoning.


Staged Continual Pretraining (CPT)

The CPT pipeline is divided into two stages:

  • Stage 1: Enhances foundational text and vision capabilities using a mix of text-only, replay, and multimodal tokens. The training corpus covers mathematical/scientific reasoning, coding, document/chart/image understanding, and OCR tasks. All model components are unfrozen, and training is performed with long sequences (32768 tokens) and checkpoint averaging.
  • Stage 2: Focuses on visual reasoning via synthetic data augmentation. The pipeline generates task-centric samples targeting image reconstruction, visual matching, object detection, and counting, with controlled difficulty and data hygiene. Only the projection network and decoder are updated, with the vision encoder frozen. Training uses a sequence length of 16384 and loss is computed only on responses.

Empirical ablations show that CPT Stage 2 yields substantial improvements on vision-dominant benchmarks (e.g., +9.65 points on MathVerse Vision-Dominant), confirming the efficacy of targeted synthetic augmentation for visual reasoning.


Supervised Fine-Tuning (SFT)

SFT is performed on millions of high-quality instruction-response pairs, each containing explicit reasoning traces. Data curation emphasizes diversity, correctness, and sample efficiency, employing LLM-as-Judge and execution-based verification, rigorous de-duplication, and format checks. Annotator ablations indicate minimal performance differences between DeepSeek-R1-0528 and gpt-oss-120b, with the latter chosen for efficiency.

Training involves multiple epochs at long sequence lengths (up to 49152 tokens), stratified sampling, and checkpoint averaging. Only the decoder is updated during SFT, with loss computed on responses. This process yields a model with robust reasoning, instruction-following, and domain-specific competence.


Evaluation Methodology

Textual capabilities are assessed using the Artificial Analysis Intelligence Index, which aggregates ten heterogeneous benchmarks (MMLU-Pro, GPQA Diamond, AIME2025, IFBench, AA-LCR, TerminalBench-Hard, etc.) for a holistic measure of general intelligence. Vision capabilities are evaluated using VLMEvalKit across benchmarks such as MMMU, LogicVista, MathVision, MathVista, MathVerse, MMStar, CharXiv, AI2D, and BLINK, adhering to standardized protocols for reproducibility.


Results and Comparative Analysis

Apriel-1.5-15B-Thinker achieves a score of 52 on the Artificial Analysis Intelligence Index, matching DeepSeek-R1-0528 and closely tracking Gemini-2.5-Flash and Claude Sonnet-3.7, despite requiring significantly fewer computational resources. Figure 2

Figure 2: Apriel-1.5-15B-Thinker compared to the best open source LLMs on the Artificial Analysis Intelligence Index.

Figure 3

Figure 3

Figure 3: Apriel-1.5-15B-Thinker compared with state-of-the-art LLMs.

On individual benchmarks, Apriel attains 87% on AIME2025, 62% on IFBench, and 68% on τ2\tau^2Bench (Telecom), outperforming larger open-source baselines. On TerminalBench-Hard, it achieves 10%, competitive with much larger proprietary models. Figure 4

Figure 4: Artificial Analysis Intelligence Index vs. Total Parameters (log scale). Apriel-1.5-15B-Thinker lies in the ``most attractive quadrant.''

This placement highlights the model's superior cost-to-intelligence trade-off, offering robust capabilities at moderate scale.

On vision benchmarks, Apriel averages 64.7% across the suite, outperforming similarly-sized and larger open-weight models (e.g., Kimi-VL-2506, Qwen-2.5-VL-3B-Instruct) and closely tracking Llama 4 Maverick (400B parameters). It demonstrates strong performance on document-centric and diagram understanding tasks (CharXiv descriptive: 88.2%, AI2D: 82.87%), competitive results on general multimodal reasoning (MMMU: 70.22%), and solid scores on visual mathematical tasks (MathVista: 75.5%). However, performance on vision-dominant tasks (MMMU-PRO Vision: 48.21%) and complex visual logic (LogicVista: 58.39%) indicates room for further improvement. Figure 5

Figure 5: Average performance across the benchmark suite (higher is better). The chart aggregates scores from MMMU, MMMU-Pro, LogicVista, MathVision, MathVista, MathVerse, MMStar, CharXiv, AI2D, and BLINK.

The model exhibits a pattern of stronger performance on tasks combining visual inputs with substantial textual reasoning, while showing moderate results on purely visual reasoning tasks. The gap between surface-level document comprehension and deeper contextual reasoning (e.g., CharXiv descriptive vs. reasoning) remains a key area for future work.


Implications and Future Directions

Apriel-1.5-15B-Thinker demonstrates that strategic mid-training design—particularly staged continual pretraining with heterogeneous synthetic signals—can yield frontier-level multimodal reasoning in compact models. This approach challenges the prevailing assumption that state-of-the-art performance requires massive scale and expensive RL pipelines, instead highlighting the importance of data quality, curriculum design, and efficient architectural scaling.

Practically, the model's single-GPU deployability and open-source release lower operational barriers for privacy-preserving, cost-aware deployments in enterprise and research contexts. The findings suggest that further advances in mid-training curricula, synthetic data generation, and targeted alignment could continue to close the gap with larger proprietary systems, especially in agentic and interactive workflows.

Future work will focus on extending multimodal capabilities, strengthening agentic abilities, and exploring targeted alignment techniques, guided by the principles of efficient scaling and high-quality data demonstrated in this paper.


Conclusion

Apriel-1.5-15B-Thinker provides compelling evidence that a data-centric, staged mid-training pipeline can deliver competitive multimodal reasoning at moderate scale, enabling efficient, accessible deployment without sacrificing capability. The model's performance across text and vision benchmarks validates the efficacy of continual pretraining and high-signal SFT, setting a precedent for future research on compact, open, and cost-effective foundation models.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

A simple explanation of “Apriel-1.5-15B-Thinker: Mid-training is all you need”

Overview

This paper describes a new AI model called Apriel-1.5-15B-Thinker. It’s a “multimodal” model, which means it can understand both text and images. The big idea is that you don’t need a gigantic, super expensive model to get great reasoning skills. Instead, with smart “mid-training” (the practice steps between basic pretraining and final polishing), a mid-sized, open model can reach frontier-level performance and still run on a single high-end graphics card (GPU). The team releases the model and training recipes so others can use and paper them.

What questions were the researchers trying to answer?

  • Can a compact, open model that reads text and images reason as well as much larger, expensive models?
  • Can careful training in the middle of the process (mid-training) beat brute-force scale?
  • How can we make the model better at understanding visual details, diagrams, charts, and math in pictures?
  • Can we keep costs low and make it practical for companies that need privacy or on-premise use?

How did they do it? Methods explained simply

The training approach has three main stages. Think of building a smart student who first gets a stronger brain, then practices widely, and finally gets coached with high-quality lessons.

  1. Depth Upscaling (making the “brain” deeper without starting over)
    • Analogy: Adding more floors to a building instead of tearing it down and rebuilding from scratch.
    • They started from an existing model (Pixtral-12B), then increased its “layers” (from 40 to 48) so it could handle harder reasoning.
    • They trained mostly on diverse text (like math problems, code, and tech articles) to strengthen its thinking.
  2. Continual Pretraining (CPT): two phases of focused practice
    • CPT Stage 1: Build strong basics in both text and vision
      • Mixed practice: 50% text (math, science, coding), 30% images (documents, charts, captions, OCR), and 20% replayed text.
      • All parts of the model (vision, the link between vision and text, and the text brain) were trained together.
    • CPT Stage 2: Targeted visual reasoning with synthetic tasks
      • Analogy: Custom drills made from real pictures to sharpen specific skills.
      • Four core practice types:
      • Image Reconstruction: fill in masked parts to learn whole-scene understanding.
      • Visual Matching: match cropped parts or changed views to improve fine-grained recognition.
      • Object Detection: find and locate things in images.
      • Counting: count objects or categories accurately.
      • Here, the “vision camera” part stayed frozen (unchanged) while the “translator” (projection network) and “storyteller” (text decoder) learned to reason better about images.
      • Result: Clear improvements on image-heavy math and diagram tasks after this stage.
  3. Supervised Fine-Tuning (SFT): high-quality coaching with step-by-step solutions
    • Analogy: A tutor gives carefully chosen problems and shows their reasoning steps.
    • The team curated millions of instruction–response pairs across math, coding, science, tools, safety, and general reasoning.
    • Each answer teaches the model the reasoning steps, not just the final result.
    • They cleaned the data thoroughly (remove duplicates, unsafe content, wrong answers, format errors) and verified correctness where possible (e.g., running code or checking math).
    • Importantly, they did not use reinforcement learning or “preference tuning”—so the gains mostly come from smart mid-training and great data.

Helpful terms (in everyday language):

  • Multimodal: handles text and images together.
  • Parameters: like adjustable “knobs” in the AI’s brain—more knobs can mean more capacity.
  • Mid-training: the middle phase of training combining extra practice (CPT) and tutoring (SFT).
  • Freeze/Unfreeze: whether a part of the model is allowed to change during training.
  • Projection network: the “translator” that connects vision features to the language brain.
  • Synthetic data: practice problems generated from or about real images to target specific skills.
  • Single GPU: runs on one high-end graphics card—important for practical deployment.

Main findings and why they matter

  • Strong overall performance without being huge:
    • On a respected combined score (Artificial Analysis Intelligence Index), Apriel scores about 52—matching the well-known DeepSeek-R1-0528—while using fewer resources.
    • Across ten image benchmarks, it averages within roughly 5 points of big proprietary models like Gemini-2.5-Flash and Claude Sonnet-3.7.
  • Especially good at math and document/chart understanding:
    • It does very well on math-related tasks and on reading charts and diagrams (for example, CharXiv descriptive tasks).
  • CPT Stage 2 (the visual drills) measurably boosted image reasoning:
    • After adding targeted visual practice, scores improved on several vision-heavy benchmarks (like MathVerse Vision-Dominant and AI2D).
  • Practical and accessible:
    • It runs within single-GPU limits, making it realistic for organizations with limited hardware or strict privacy requirements.
  • Open release:
    • The model weights, recipes, and evaluation methods are released under MIT license, enabling others to reproduce and build on this work.

What does this mean for the future?

  • Training design matters as much as (or more than) size:
    • Careful, staged mid-training and high-quality data can close the gap with frontier models, reducing costs and power needs.
  • More people and organizations can use advanced AI:
    • Because it’s compact, open, and deployable on a single GPU, this approach helps schools, startups, and privacy-focused teams adopt strong multimodal AI.
  • Next steps:
    • The authors plan to further strengthen multimodal abilities and interactive “agent” skills (using tools and acting in workflows), and apply targeted alignment techniques where helpful.

In short: Instead of making models endlessly bigger, this paper shows that smart practice in the middle—especially well-designed visual drills and high-quality tutoring—can produce a compact, open model that thinks clearly about both text and images and competes with much larger systems.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper leaves the following gaps and open questions that future researchers could address:

  • Compute transparency: No end-to-end training and inference cost disclosure (GPU types/counts, total GPU-hours per stage, tokens processed per stage, energy usage), making the “cost-effective” claims hard to verify and compare.
  • Data scale and composition: Absent exact dataset sizes, token counts, per-domain breakdowns, and source lists for CPT Stage 1 and SFT; percentages are given without concrete totals or per-category statistics.
  • Synthetic visual curriculum reproducibility: The Stage 2 synthetic augmentation pipeline is described conceptually but lacks released code, prompt templates, augmentation parameters, difficulty schedules, and dataset identifiers to reproduce the gains.
  • Vision encoder specifics: Missing architecture details (model family, input resolution, visual token count, patching scheme, preprocessing pipeline), preventing precise reproduction and analysis of vision-token to text-token interactions.
  • Ablations on training choices: No systematic ablation of key decisions (freezing vs. unfreezing vision encoder in Stage 2, loss-on-response vs. full-sequence loss, sequence lengths at each phase, checkpoint averaging) to isolate their individual contributions.
  • Depth upscaling efficacy: Depth-only scaling (40→48 layers) is used, but there is no comparison with width scaling, mixture-of-experts, or alternative capacity expansion strategies, nor analysis of training stability or diminishing returns.
  • Projection network realignment impact: The projection realignment stage is introduced without quantitative evaluation of its standalone effect on multimodal alignment or downstream metrics.
  • Checkpoint averaging methodology: Weight averaging of intermediate checkpoints and multi-run merges is used, but there is no analysis of when/why it helps, potential risks (mode interpolation failures), or comparisons with alternatives (e.g., SWA, EMA, LoRA merges).
  • Token-level training specifics: Absent optimizer details (optimizer type, betas, weight decay, gradient clipping, dropout), batch sizes (global/micro), number of steps/epochs in CPT stages, and token masking schemes—hindering reproducibility and stability assessment.
  • Long-context design: The architecture’s approach to long-context (e.g., rotary embeddings, attention scaling, memory mechanisms) is unspecified; AA-LCR scores are relatively low—raising questions about the model’s long-context retrieval and reasoning capabilities.
  • Inference footprint and deployment: Claims of single-GPU deployability lack concrete latency, throughput, memory footprint, quantization strategies, batch sizes, and hardware specifics to validate enterprise feasibility.
  • Safety and alignment evaluation: Safety is acknowledged but not deeply pursued; there is no structured red-teaming, jailbreak resistance, harmful content benchmarks, or calibration of refusal behavior despite inclusion of security and moderation data in SFT.
  • Preference optimization/RL synergy: The paper intentionally excludes RLHF/DPO to isolate mid-training effects but leaves open how preference optimization might further improve reasoning, faithfulness, or safety—and at what cost.
  • Chain-of-thought (CoT) effects: SFT uses explicit reasoning traces, yet the paper does not quantify impacts on correctness vs. verbosity, exposure of sensitive reasoning, or calibration; no CoT suppression/elicitation controls are analyzed.
  • Annotator quality and bias: Labeling by gpt-oss-120b vs. DeepSeek-R1 shows minimal differences on a small set, but there is no broader inter-annotator agreement, bias analysis, error taxonomy, or adjudication workflow to ensure label reliability at scale.
  • Verification coverage: Execution/LLM-as-judge verification is mentioned but not quantified (accept/reject rates, residual error rates, domains covered), nor is the judge model diversity or robustness discussed.
  • Decontamination rigor: Benchmark decontamination is claimed without reporting contamination rates, detection methodology, thresholds, or post hoc audits—risking optimistic scores.
  • Evaluation dependence and variance: Heavy reliance on the Artificial Analysis Intelligence Index; for self-reported scores, judging model and timeout differences exist—no statistical uncertainty (confidence intervals) or multiple-seed evaluations are provided.
  • Failure-mode analysis: Low TerminalBench-Hard score (≈10%) is noted, but there is no task-level error analysis (tool invocation failures, environment setup, state management) or remediation strategies for agentic workflows.
  • Vision-dominant weakness: The model lags on vision-dominant tasks (e.g., MMMU-Pro Vision, CharXiv reasoning); no targeted analysis of failure types (spatial reasoning, compositionality, fine-grained grounding) beyond aggregate metrics.
  • Robustness under distribution shift: No evaluation on adversarial or perturbed inputs (occlusions, noise, compression), out-of-distribution images, OCR errors, or calibration under uncertainty—despite synthetic augmentation suggesting robustness training.
  • Multilingual coverage: Language scope of training/evaluation is not specified; multilingual reasoning, OCR for non-Latin scripts, and cross-lingual transfer remain unassessed.
  • Modal breadth: The model is image-text only; capabilities on video, audio, and document sequences (multi-page PDFs) are untested, and the architecture’s extensibility to additional modalities is not demonstrated.
  • Tool-use breadth: SFT includes tool-calling data, but evaluation is limited (TerminalBench); there is no assessment across diverse APIs (retrieval, math solvers, browsers), tool reliability, and output validation pipelines.
  • Fairness and demographic bias: No measurement of bias across protected attributes in text or vision (e.g., demographic misclassification, stereotype propagation) or mitigation strategies.
  • License and dataset provenance: Weights are MIT-licensed, but underlying data sources (web, synthetic, third-party) and their licenses/provenance are not documented, complicating downstream compliance.
  • Generalization to other bases: Starting from Pixtral-12B (Unsloth variant no longer available) raises questions about portability of the mid-training recipe to other base models and vision encoders; no replication on multiple bases.
  • Curriculum scheduling: The staged data presentation is central but lacks detail on scheduling heuristics (interleaving ratios, pacing, difficulty progression), making it hard to reapply or tune.
  • Selective loss computation trade-offs: Loss on responses only (CPT Stage 2 and SFT) is used, but there’s no comparison with token-wise losses across prompts/contexts to understand signal-to-noise effects on reasoning and alignment.
  • Metrics beyond accuracy: No reporting on calibration (Brier score, ECE), uncertainty estimation, or abstention behavior, which are crucial for risk-sensitive deployments.
  • Human evaluation: Absent human studies for instruction-following quality, helpfulness, harmlessness, and coherence; third-party vision eval uses VLMEvalKit but lacks human adjudication on ambiguous items.
  • Domain transfer: Strong results in math and document understanding raise questions about transfer to other specialized domains (medical imaging, geospatial, industrial diagrams) not covered by the benchmarks.
  • Continual learning risks: Freezing patterns and text-only SFT may induce modality-specific forgetting; no analysis of catastrophic forgetting between CPT stages and SFT on vision tasks.
  • Scaling laws for mid-training: The paper claims mid-training closes capability gaps but provides no scaling law analysis (performance vs. tokens/parameters/stage durations) to guide budget allocation.
  • Transparency on hyperparameters: Beyond LR and sequence length, key hyperparameters (warmup steps counts, scheduler specifics, regularization, gradient accumulation) are missing, limiting exact reproduction.
  • Statistical significance: Reported deltas (e.g., +9.65 on MathVerse Vision-Dominant) lack variance estimates or significance tests; robustness across seeds and inference settings is unknown.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Practical Applications of Apriel-1.5-15B-Thinker

Below are actionable, real-world applications grounded in the paper’s reported capabilities, training design, and release artifacts (open weights under MIT, training recipes, evaluation protocols). Items are grouped by deployment horizon and tagged with sectors, candidate tools/products/workflows, and key assumptions/dependencies that affect feasibility.

Immediate Applications

The following can be piloted or deployed now, leveraging the model’s strengths in multimodal document/chart understanding, math/science reasoning, long-context inputs, and on-prem single-GPU deployment.

  • Enterprise document and chart intelligence copilot
    • Sectors: finance, insurance, legal, telecom, public sector, enterprise operations
    • What it does: parse PDFs/images, extract structured fields, explain chart trends, validate figures against text, answer “what-if” questions, summarize packets of documents (policies, contracts, RFPs) with citations
    • Tools/products/workflows: vLLM/TensorRT-LLM serving; Unstructured, pdfminer/pdfplumber, Apache Tika for ingestion; OCR fallback (PaddleOCR/Tesseract) + Apriel for reasoning; RAG with a vector DB (FAISS, Milvus); governance logging
    • Assumptions/dependencies: benchmarked strength on CharXiv/AI2D supports chart/diagram tasks; add guardrails and citation/verification checks; ensure PDF-to-image fidelity; quantify latency/memory for target GPU (24–48 GB VRAM or quantized)
  • Privacy-preserving on-prem AI assistant for regulated data
    • Sectors: healthcare (admin), finance, government, defense, telecom
    • What it does: air‑gapped enterprise assistant for knowledge Q&A, policy interpretation, data classification, internal BI reporting; handles mixed text+image content under strict privacy
    • Tools/products/workflows: containerized deployment (NVIDIA NIM, vLLM, TGI); SSO/role-based access; audit trails; policy filters (PII/PHI redaction)
    • Assumptions/dependencies: model fits single high-end GPU; safety SFT present but not exhaustive—add policy checkers and allow-only tool lists
  • Claims, KYC, and workflow triage automation
    • Sectors: insurance (claims), banking (KYC/AML), HR/onboarding
    • What it does: extract entities from scanned documents, verify completeness, cross-check claims vs. evidence images, produce structured case summaries and discrepancy reports
    • Tools/products/workflows: schema-constrained output (JSON/XML); programmatic verifiers for required fields; human-in-the-loop review UI
    • Assumptions/dependencies: strong document/diagram reasoning; deploy validators for high-stakes decisions; maintain bias/fairness checks
  • STEM tutoring and classroom assistance with step-by-step reasoning
    • Sectors: education, edtech, community colleges, vocational programs
    • What it does: step-by-step math/science solutions, diagram-based problem walkthroughs, chart reading and interpretation, rubric-aligned hints
    • Tools/products/workflows: LMS plugins (Moodle, Canvas), Jupyter/Colab plugins; structured “reasoning then answer” UI; solution verification with CAS/unit tests where applicable
    • Assumptions/dependencies: high scores on AIME/MathVista/MathVerse support quality; add citation to textbooks or solution manuals where possible; enable teacher oversight
  • Secure coding and debugging assistant behind the firewall
    • Sectors: software, enterprise IT, embedded systems
    • What it does: generate/refactor code, write tests, explain diffs, reason about algorithms; summarize logs/screenshots from failing tests; propose shell commands (with review)
    • Tools/products/workflows: VS Code/JetBrains extensions; unit test scaffolding; sandboxed execution; policy rules for command generation
    • Assumptions/dependencies: strong coding/math benchmarks; TerminalBench score indicates limited autonomy—keep human-in-the-loop; instrument with execution verifiers
  • Telecom runbook and troubleshooting copilot
    • Sectors: telecom, network operations, MSPs
    • What it does: interpret diagrams/topologies, triage incidents, propose runbook steps; analyze logs, dashboards, and charts for anomaly narratives
    • Tools/products/workflows: API connectors to NOC tooling; structured recommendations with confidence and required evidence; change-control workflow integration
    • Assumptions/dependencies: strong τ²‑Bench (Telecom) results support domain viability; limit autonomous actions; incorporate observability verifiers
  • Business intelligence (BI) narrative and QA for analytics artifacts
    • Sectors: business operations, FP&A, sales ops
    • What it does: interpret dashboards and exported charts, write executive narratives, detect inconsistencies between visuals and text, create “explain this chart” drill-down
    • Tools/products/workflows: connectors to Tableau/Power BI/Looker APIs; “visual QA” checks on chart-image exports; report-generation templates
    • Assumptions/dependencies: high CharXiv descriptive performance; ensure chart export quality; add numeric cross-checkers against source tables
  • Content moderation and compliance review with image+text
    • Sectors: social platforms, internal communications, legal/compliance
    • What it does: policy classification, PII/PHI detection, content redaction, rationale for decisions; handle screenshots and scanned documents
    • Tools/products/workflows: policy schemas; redact/transform actions; dual-review queues for edge cases
    • Assumptions/dependencies: safety SFT present but limited—layer deterministic rules and secondary classifiers; document false-positive/negative rates
  • Long-context multi-document assistant
    • Sectors: legal, research, diligence/M&A, public records
    • What it does: ingest large packets (tens of thousands of tokens) spanning PDFs/images, produce summaries, timelines, and grounded answers; highlight contradictions
    • Tools/products/workflows: chunking + retrieval; “quote and cite” answer templates; section-level provenance tracking
    • Assumptions/dependencies: SFT at 32k–49k length supports long-context use; throughput must be engineered (paging/RAG + caching)
  • Open-weight baseline for academic courses and rapid enterprise prototyping
    • Sectors: academia, applied research labs, startups
    • What it does: teach efficient multimodal training (depth upscaling, staged CPT, SFT), replicate recipes, run ablations; build custom vertical copilots quickly
    • Tools/products/workflows: curriculum packs; Colab/Slurm job scripts; LoRA/QLoRA adapters for domain tuning
    • Assumptions/dependencies: MIT license allows commercial/academic use; retain data governance for added domain corpora
  • Targeted synthetic data generation for vision reasoning
    • Sectors: ML tooling vendors, teams training vertical VLMs
    • What it does: reuse the paper’s augmentation tasks (reconstruction, matching, counting, object presence) to boost visual reasoning for new domains
    • Tools/products/workflows: image augmentation pipelines; curriculum schedulers; difficulty modulation; validator tasks
    • Assumptions/dependencies: pipeline quality and data hygiene determine gains; avoid synthetic overfitting by mixing real data and verification tasks

Long-Term Applications

The following are enabled by the paper’s training strategy and open release but require additional research, scaling, or alignment (e.g., stronger pure visual reasoning, RLHF/RLAIF, real-time constraints, safety cases).

  • Agentic automation with reliable tool use and system control
    • Sectors: DevOps/IT ops, SecOps, enterprise back-office, RPA
    • What it could do: autonomously execute multi-step workflows (shell, API, database), verify outcomes, and recover from errors
    • Gaps to close: improve TerminalBench-like competence, add preference optimization/RL, robust action safeguards, programmatic verifiers
  • Industrial visual inspection and quality control
    • Sectors: manufacturing, logistics, retail inventory, construction
    • What it could do: count/locate parts, detect assembly errors, verify bill-of-materials from photos, audit shelf stock
    • Gaps to close: higher accuracy on vision-dominant tasks, real-time inference, calibrated uncertainty, integration with specialized detectors
  • Clinical document and imaging assistants
    • Sectors: healthcare providers, payers
    • What it could do: parse clinical notes/forms, link charts to narratives, assist with administrative coding; eventually multimodal Q&A on certain imaging types
    • Gaps to close: medical-domain fine-tuning with verified data, regulatory approval processes, strict bias/safety evaluation; avoid diagnostic claims until validated
  • Advanced, verifiable document-grounded QA with visual retrieval
    • Sectors: legal discovery, compliance audits, due diligence
    • What it could do: multi-hop reasoning over large corpora of mixed text/diagrams, with auto-citations, contradiction detection, and “proof of answer” artifacts
    • Gaps to close: stronger visual-logic skills, better retrieval over visual elements, formal verification pipelines
  • Multimodal program synthesis from UIs, sketches, or architecture diagrams
    • Sectors: software engineering, product design, data engineering
    • What it could do: convert screenshots or whiteboard diagrams into code scaffolds, infrastructure-as-code, or data pipelines
    • Gaps to close: improved compositional visual reasoning; rigorous executable verification and security constraints
  • Embodied AI and robotics task planning
    • Sectors: warehouse automation, field service, home robotics
    • What it could do: interpret scenes/diagrams, plan steps for manipulation, verify intermediate states
    • Gaps to close: real-time perception/action loops, closed-loop control, sim2real transfer; enhanced spatial reasoning
  • AR copilots for field technicians and learners
    • Sectors: utilities, telecom field ops, manufacturing MRO, education
    • What it could do: interpret manuals and diagrams in-context via camera stream, overlay next steps, count/verify components on-site
    • Gaps to close: low-latency edge deployment, robust vision under varied conditions, safety guardrails for procedural guidance
  • Government policy analysis and public records automation
    • Sectors: central/local governments, regulators, audit offices
    • What it could do: analyze large legislative/regulatory corpora, model impact narratives, automate FOIA processing with redaction and cross-document linking
    • Gaps to close: alignment for neutrality, transparent provenance, strong redaction guarantees, public-sector procurement/compliance requirements
  • Green AI via right-sizing models in production
    • Sectors: any enterprise replacing larger proprietary models
    • What it could do: reduce cost and energy footprint by swapping in 15B-class models where performance is sufficient
    • Gaps to close: formal energy/cost benchmarks, performance SLOs, confidence-estimation and fallback policies to larger models when needed
  • Mid-training-as-a-service platforms
    • Sectors: ML infrastructure, model providers
    • What it could do: productize depth upscaling + staged CPT for customer base models, offering efficient capability upgrades without full pretraining
    • Gaps to close: generalized curricula design, reliable evaluation harnesses per domain, automated data quality governance

Cross-Cutting Dependencies and Assumptions

  • Hardware and deployment
    • Single high-end GPU recommended; consider 4/8-bit quantization for smaller GPUs; test latency with long contexts (32k–49k tokens)
    • Prefer optimized backends (vLLM, TensorRT-LLM); batch carefully for multimodal inputs
  • Safety, alignment, and governance
    • The paper prioritizes performance; safety mitigations are present but not exhaustive
    • Add policy filters, jailbreak detection, and content moderation layers; for high-stakes domains, require human approval and auditable logs
  • Verification and evaluation-in-the-loop
    • Use execution/testing harnesses (for code/math), schema validators (for structured outputs), and citation/provenance requirements (for document QA)
    • Employ dual-model or tool-augmented adjudication (LLM-as-judge plus deterministic checkers)
  • Data, privacy, and compliance
    • Ensure compliant handling of sensitive inputs; consider on-prem and air-gapped deployments where necessary
    • Maintain decontamination for internal benchmarks and monitor for data leakage
  • Scope alignment with strengths/limitations
    • Strong on charts/diagrams/document understanding and math/science reasoning
    • Moderate on purely vision-dominant logic; design use cases accordingly or combine with specialized CV models

These application paths reflect the model’s demonstrated strengths, deployment profile, and the paper’s training innovations, balancing near-term value with a roadmap for higher autonomy and domain specialization as alignment and perception capabilities advance.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • AA-LCR: A benchmark assessing long-context reasoning ability. "AA-LCR -- long-context reasoning"
  • Ablations: Controlled comparative experiments to isolate the effect of a component or setting. "Small-scale ablations using DeepSeek-R1-0528 \cite{deepseekai2025deepseekr1incentivizingreasoningcapability} and gpt-oss-120b \cite{openai2025gptoss120bgptoss20bmodel} presented in Table \ref{tab:annotator_comparison} show minimal performance differences between annotators for our base model."
  • AI2D: A diagram-understanding benchmark for visual reasoning over diagrams. "AI2D\cite{kembhavi2016ai2d}: Diagram understanding benchmark."
  • AIME 2025: A competition-level mathematics benchmark testing advanced problem solving. "AIME 2025 -- competition-level mathematics"
  • Air-gapped deployments: Systems isolated from external networks for privacy and security. "organizations requiring on-premises or air-gapped deployments for privacy and compliance need compact models with predictable resource footprints"
  • Agentic capabilities: The capacity of a model to act autonomously in tool-using or task-executing workflows. "offering strong reasoning and agentic capabilities without the overhead of massive parameter counts."
  • Annotator model: A model used to generate or evaluate training labels for SFT. "We therefore adopt gpt-oss-120b as our annotator model due to its greater compute efficiency."
  • Artificial Analysis Intelligence Index: A composite metric aggregating multiple benchmarks to measure general intelligence. "On the Artificial Analysis Intelligence Index, Apriel-1.5-15B-Thinker\ attains a score of 52"
  • BLINK: An open-domain vision–language benchmark focused on visual perception tasks. "BLINK\cite{fu2024blink}: Benchmark measuring performance on various visual perception tasks."
  • Chain-of-thought competence: The ability to produce step-by-step reasoning traces that solve complex tasks. "careful data curation and training strategy can unlock sophisticated chain-of-thought competence without relying solely on extreme scale."
  • CharXiv: A chart understanding and reasoning benchmark. "CharXiv\cite{wang2024charxiv}: Benchmark measuring descriptive and reasoning question answering capabilities across basic and complex chart elements respectively."
  • Checkpoint averaging: Averaging weights across checkpoints to improve stability and performance. "Techniques such as depth upscaling (capacity expansion without pretraining from scratch), selective loss computation, and checkpoint averaging improve efficiency and stability."
  • Cosine decay: A learning-rate schedule that decays following a cosine curve. "a learning rate of 5e-5 with cosine decay and 10\% warmup."
  • Counting: Vision tasks requiring precise enumeration of objects or categories. "Counting: Enhance the ability to count and distinguish specific visual elements by querying total or category-specific counts."
  • Cross-modal training procedures: Methods that jointly train across text and vision modalities. "including depth upscaling and cross-modal training procedures."
  • Data Hygiene and Difficulty Control: Practices to ensure clean, well-structured data and calibrated task difficulty. "Data Hygiene and Difficulty Control"
  • De-duplication: Removing duplicate samples to enhance data diversity and quality. "rigorous de-duplication to enhance data diversity"
  • Decoder: The generative component of the model producing outputs from latent representations. "we first upscale the decoder by increasing the number of hidden layers from 40 to 48"
  • Decontamination: Filtering out samples that overlap with evaluation benchmarks. "and a decontamination stage to remove any samples overlapping with the benchmarks."
  • Depth upscaling: Increasing the number of layers to expand capacity without retraining from scratch. "we first upscale the base model via depth upscaling"
  • Document understanding: Vision–language tasks focused on extracting and reasoning over document content. "data on document understanding, chart understanding and reasoning"
  • Execution-based verification: Validating outputs by executing code or procedures to check correctness. "execution-based verification where applicable"
  • GPQA Diamond: A graduate-level science/engineering problem-solving benchmark. "GPQA Diamond -- graduate-level problem solving in science/engineering"
  • Grounding: Linking textual references to specific visual entities or regions. "Strengthen grounding and localization by identifying object presence and approximate location."
  • Heuristic filtering: Rule-based filtering to remove low-quality or malformed samples. "and heuristic filtering to remove low-quality samples."
  • Humanity’s Last Exam (HLE): A high-difficulty, multi-disciplinary reasoning benchmark. "Humanity’s Last Exam -- multi-disciplinary high-difficulty reasoning"
  • IFBench: An instruction-following and compliance benchmark. "IFBench -- instruction following and compliance"
  • Image Reconstruction: Tasks where parts of images are masked to learn holistic scene priors. "Image Reconstruction: Learn holistic scene priors and part–whole reasoning by masking image regions."
  • Instruction-response pairs: Training samples comprising an instruction and the model’s answer. "multimodal instruction-response pairs"
  • Linear decay: A learning-rate schedule that decreases linearly over training steps. "a learning rate of 5e-5 with linear decay."
  • LLaVA architecture: A vision–LLM design connecting a vision encoder to a language decoder via a projector. "Pixtral follows the LLaVA architecture \cite{liu2023llava}, consisting of a vision encoder connected to a multimodal decoder through a two-layer fully connected projection network."
  • LLM-as-Judge: Using an LLM to assess the correctness or quality of samples. "we verified the data's correctness using LLM-as-Judge and execution-based verification where applicable"
  • Localization: Identifying the location of objects within an image. "Strengthen grounding and localization by identifying object presence and approximate location."
  • LogicVista: A benchmark for multimodal logical reasoning. "LogicVista\cite{xiao2024logicvista}: Multi-modal logical reasoning benchmark targeting different reasoning skill types in visual contexts."
  • MathVerse: A benchmark testing mathematical reasoning across modalities and information content levels. "MathVerse\cite{zhang2024mathverse}: Mathematical benchmark measuring model performance across different levels of information content across multiple modalities."
  • MathVista: A benchmark combining visual and mathematical challenges. "MathVista\cite{lu2023mathvista}: Benchmark combining challenges from various visual and mathematical tasks."
  • MathVision: A benchmark for mathematical reasoning in visual contexts. "MathVision\cite{wang2024mathvision}: Mathematical reasoning within visual contexts."
  • Mid-training: The phase combining continual pretraining and SFT after initial pretraining. "We define mid-training as a combination of the continual pretraining and SFT stages"
  • MMMU: A general multimodal understanding benchmark emphasizing visual knowledge and reasoning. "MMMU\cite{yue2023mmmu}: Multi-modal understanding benchmark focusing on evaluating visual knowledge and reasoning."
  • MMMU-Pro: An enhanced version of MMMU for more rigorous visual knowledge and reasoning. "MMMU-Pro\cite{yue2024mmmu}: Enhanced Multi-modal understanding benchmark focusing on evaluating visual knowledge and reasoning."
  • MMStar: A vision-indispensable benchmark where images are necessary to solve tasks. "MMStar\cite{chen2024mmstar}: Vision-indispensable benchmark focusing on tasks that cannot be solved with only knowledge or without using the image."
  • Multimodal Continual Pretraining (CPT): Ongoing training to strengthen text and vision capabilities using staged data. "(2) Staged Multimodal Continual Pretraining (CPT): We adopt a two-phase CPT strategy."
  • Multimodal decoder: A language decoder adapted to process visual inputs via a projection layer. "consisting of a vision encoder connected to a multimodal decoder through a two-layer fully connected projection network."
  • Object Detection: Identifying the presence and approximate location of objects in images. "Object Detection: Strengthen grounding and localization by identifying object presence and approximate location."
  • OCR-related tasks: Vision tasks involving recognition of text within images. "OCR-related tasks"
  • On-premises: Deployment in an organization’s own infrastructure rather than the cloud. "organizations requiring on-premises or air-gapped deployments for privacy and compliance"
  • Open-domain Vision–Language Reasoning: Tasks spanning broad visual topics requiring integrated vision and language reasoning. "Open-domain Vision–Language Reasoning"
  • Open-weights: Models with publicly released parameters enabling reproducibility and fine-tuning. "a 15-billion parameter open-weights multimodal reasoning model"
  • Pass@1: The metric of succeeding on the first attempt in code or QA tasks. "Evaluation (pass@1 or accuracy, as applicable) on multimodal benchmarks"
  • Preference optimization: Post-training methods (e.g., DPO) aligning outputs to human preferences. "without reinforcement learning or preference optimization"
  • Projection network: The module mapping vision encoder outputs into the LLM’s embedding space. "a vision encoder connected to a multimodal decoder through a two-layer fully connected projection network."
  • Projection network realignment: A training stage to recalibrate the projector with multimodal data. "Projection network realignment"
  • Rejection sampling: Discarding generated samples that fail verification or quality checks. "implementing rejection sampling to discard incorrect or low-quality instruction-response pairs."
  • Replay data: Previously seen tokens re-used to stabilize training and retain capabilities. "Half of these tokens serve as replay data"
  • SciCode: A benchmark for scientific computing and reasoning. "SciCode -- scientific computing and reasoning tasks"
  • Selective loss computation: Computing loss only on chosen parts of the sequence (e.g., responses). "Techniques such as depth upscaling (capacity expansion without pretraining from scratch), selective loss computation, and checkpoint averaging improve efficiency and stability."
  • Sequence packing: Concatenating multiple samples into a single sequence to utilize context length efficiently. "sequence length of 8192 (with sequence packing)"
  • Single-GPU deployment constraints: Operating models within the memory/latency limits of one GPU. "operating within single-GPU deployment constraints."
  • Staged curriculum: Introducing data in phases to guide capability development. "all introduced through a staged curriculum."
  • Supervised Fine-Tuning (SFT): Tuning the model on curated instruction–response data with explicit reasoning steps. "(3) High-Quality Supervised Fine-Tuning (SFT): We curate a diverse, high-quality, and high-signal set of samples for supervised fine-tuning."
  • Targeted synthetic data generation: Creating tailored synthetic datasets to train specific skills (e.g., spatial reasoning). "targeted synthetic data generation addressing spatial structure, compositional understanding, and fine-grained perception"
  • Terminal-Bench Hard: A benchmark of real-world Linux shell execution and system tool use. "Terminal-Bench Hard -- real-world Linux shell execution and system tool use in end-to-end tasks"
  • Tool calling: Invoking external tools/APIs within a model’s reasoning process. "tool calling"
  • VLMEvalKit: A toolkit standardizing evaluation for vision–LLMs. "VLMEvalKit\cite{duan2024vlmevalkit} toolkit"
  • Vision encoder: The component extracting features from images for multimodal processing. "consisting of a vision encoder connected to a multimodal decoder through a two-layer fully connected projection network."
  • Visual Matching: Tasks requiring correspondence and discrimination across image views or anchors. "Visual Matching: Improve correspondence, retrieval, and fine-grained discrimination by matching cropped or augmented anchors to candidates across views or images."
  • Warmup: A brief initial phase increasing the learning rate from near-zero to target value. "a learning rate of 1e-5 with cosine decay and 10\% warmup."
  • τ2\tau^2-Bench Telecom: A specialized domain benchmark in applied telecom tasks. "τ2\tau^2-Bench Telecom -- specialized domain evaluation in applied tasks"
Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We found no open problems mentioned in this paper.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 9 tweets and received 16256 likes.

Upgrade to Pro to view all of the tweets about this paper:

alphaXiv

  1. Apriel-1.5-15b-Thinker (23 likes, 0 questions)