AndesVL: Mobile Multimodal LLMs

Updated 20 October 2025

AndesVL is a mobile multimodal language model framework integrating a Qwen3 backbone with modular 1+N LoRA adapters for efficient vision-language reasoning.
It employs a multi-phase training pipeline and quantization-aware techniques to reduce computational overhead while maintaining robust performance.
Its comprehensive evaluation across tasks like OCR, GUI comprehension, and mathematical reasoning demonstrates competitive results compared to cloud-based models.

AndesVL refers to the suite of mobile-side multimodal LLMs (MLLMs) introduced as a technical solution for resource-constrained edge devices, particularly mobile phones, where memory, power, and computation are fundamentally limited relative to cloud infrastructures. Targeting model sizes between 0.6B and 4B parameters, AndesVL leverages a Qwen3-based LLM backbone and contemporary visual encoders, presenting a competitive alternative to reference cloud models such as QwenVL, InternVL, GPT-4o, Gemini, and Claude Sonnet, but at a fraction of the resource footprint (Jin et al., 13 Oct 2025). The technical and scientific innovations are reflected in its modular architecture, training pipeline, scenario-specific adaptation via “1+N LoRA,” and a carefully crafted quantization/deployment strategy.

1. Model Architecture and Integration

AndesVL adopts the canonical components of multimodal LLMs: a visual encoder, a vision-language fusion/projection layer, and a LLM backbone. The encoder options extend to modules designed for low-resolution and arbitrary aspect ratio images, with outputs mapped onto the language embedding space through a trainable projection.

The core LLM is based on Qwen3 variants from 0.6B to 4B parameters, pre-trained for multimodal instruction, chain-of-thought (CoT) reasoning, and cross-modal alignment.
A distinctive feature is the “1+N LoRA” (Low-Rank Adaptation) strategy (Editor's term), where “1” denotes the frozen base LLM and “N” consists of modular, lightweight adapters trained for specific downstream or scenario tasks (e.g., OCR, mobile UI reasoning, chart comprehension).
This approach obviates continual full-model fine-tuning by updating only a small fraction of parameters for each adaptation task, enhancing efficiency and reducing the risk of catastrophic forgetting.

2. Training Pipeline: Pretraining and Fine-tuning

Pre-training

The multi-phase training pipeline includes:

Initial vision-language alignment, where the visual encoder is tuned on synthetic and real datasets for captioning, OCR, and grounding, with staged increasing token sequence lengths (e.g., up to 16k tokens).
Joint multimodal pretraining with a mixture of interleaved image–text pairs (and pure text), emphasizing token reordering (50% probability for images at sequence start), thus avoiding end-of-sequence “invisibility”.
Multi-task pretraining expands domain coverage to VQA, GUI-specific annotated datasets (notably the in-house AndesUI set), math and reasoning datasets, and chain-of-thought supervision.

Post-training

Supervised Fine-Tuning (SFT) employs dialogue-style Chat-ML data for tasks including image captioning and document comprehension.
Reinforcement Learning and Mixed Preference Optimization (MPO): The instruction variant applies a preference loss:

$\mathcal{L}_p = -\log \sigma\Bigl(\beta \Bigl(\log \pi_\theta(y_c \mid x) -\log \pi_\theta(y_r \mid x)\Bigr)\Bigr)$

For the “Thinking” variant—focused on math and complex reasoning—on-policy RL (e.g., Group Relative Policy Optimization) and an “easy-to-hard” curriculum are used to boost multistep capabilities.

3. Training Data and Benchmark Coverage

The AndesVL training set encompasses a diversity of open-source and proprietary datasets spanning:

Image captioning (Laion, Wukong), OCR (synthetic scene/document text and real data), visual grounding, VQA, GUI datasets (AndesUI), and multi-image comprehension.
Reasoning/math: curated long chain-of-thought samples from human annotation and distillation (e.g., GPT-4o output).
Data curation is rigorous, often leveraging LLM-based judge models to validate SFT and scenario-specific finetuning examples.
The benchmark evaluation suite exceeds 30 separate tasks, with notable coverage for document understanding, multi-image VQA, text extraction, hallucination mitigation, math reasoning, multilingual tasks (including MMMB/MMBench), and GUI comprehension (e.g., AndesUI-Bench).

4. Scenario-Specific Adaptation: 1+N LoRA Strategy

The 1+N LoRA adaptation framework is integral to the mobile deployment strategy:

The method freezes the base model (the “1”) post-generalist pretraining.
Task-specific adapters (“N”) are trained with scenario-adaptive loss functions, such as entity-weighted cross-entropy for domain-entity-rich outputs:

$\mathcal{L}_{\text{entity}} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T_i} \alpha_{i,t} \log P(w_{i,t} \mid x_i, w_{i,<t}, \theta)$

where $\alpha_{i,t}$ is boosted for entity tokens (e.g., numbers, object labels).

Metric-based losses (BLEU/ROUGE):

$\mathcal{L}_{\text{total}} = \lambda_1 \mathcal{L}_{\text{entity}} + \lambda_2 \mathcal{L}_{\text{BLEU/ROUGE}}$

The modularity facilitates rapid adaptation and minimizes memory/computation overhead for task transfer. Catastrophic forgetting is further mitigated.

5. Quantization and Mobile-side Acceleration

AndesVL includes targeted optimizations for edge deployment:

Quantization-Aware Training (QAT)—experiments with 2-3-4 or 8-bit weights/activations yield up to 95% performance retention versus full precision.
The key–value cache eviction algorithm (“OKV”) and speculative decoding yield up to 6.7× decoding speedup and 30.9% memory savings on devices such as MediaTek Dimensity 9500.
The full 4B variant is demonstrated to run with minimal degradation, even under aggressive quantization regimes and device resource limits.

6. Performance Analysis and Task Generalization

The model’s performance on open-source and custom benchmarks closely rivals and often surpasses contemporaneous models scaled for similar resource envelopes.

OCR: Robust extraction from documents and scene images.
Reasoning/math: Enhanced long-form CoT outputs and math problem resolution.
Multi-image comprehension: Top scores on tasks requiring joint reasoning across image sets.
GUI understanding: Superior grounding on mobile UI benchmarks (ScreenSpot, AndesUI-Bench), attributable to targeted AndesUI dataset usage and LoRA adaptation.
Multilingual understanding: High accuracy on MMMB/MMBench, driven by multilingual training data and curation.

7. Technical and Scientific Significance

AndesVL’s “generalist + adapter” paradigm offers a scalable, practically deployable framework for vision–language reasoning on mobile hardware. The separation of general and scenario-specific functions enables efficient task updates and reduces inference overhead. Quantization and acceleration innovations permit operation under edge constraints, responding to the limitations inherent in mobile environments (memory, power budget, compute throughput). The documented superiority on diverse tasks, including VQA, GUI reasoning, hallucination suppression, and mathematical problem-solving, supports its claim as a first-tier mobile MLLM within its scale domain (Jin et al., 13 Oct 2025).

Table: AndesVL Component Summary

Component	Role	Special Features
Visual Encoder	Extracts image features	Low-res/aspect ratio support
LL Model (Qwen3)	Language backbone	Instruction, CoT, multilingual
1+N LoRA Adapters	Scenario/domain adaptation	Entity-weighted loss, modular
Quantization	Memory/computation reduction	QAT, OKV, speculative decoding

In conclusion, AndesVL exemplifies a state-of-the-art approach to deploying powerful multimodal LLMs on edge devices, integrating robust generalist competence, domain adaptability via parameter-efficient adapters, and mobile-optimized computation. This architecture, supported by extensive multimodal and multilingual pretraining and stringent benchmarking, substantiates its deployment as a versatile tool for mobile AI applications requiring advanced vision–language intelligence (Jin et al., 13 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model (2025)

AndesVL: Mobile Multimodal LLMs

1. Model Architecture and Integration

2. Training Pipeline: Pretraining and Fine-tuning

Pre-training

Post-training

3. Training Data and Benchmark Coverage

4. Scenario-Specific Adaptation: 1+N LoRA Strategy

5. Quantization and Mobile-side Acceleration

6. Performance Analysis and Task Generalization

7. Technical and Scientific Significance

Table: AndesVL Component Summary

Whiteboard

Follow Topic

Continue Learning

AndesVL: Mobile Multimodal LLMs

1. Model Architecture and Integration

2. Training Pipeline: Pretraining and Fine-tuning

Pre-training

Post-training

3. Training Data and Benchmark Coverage

4. Scenario-Specific Adaptation: 1+N LoRA Strategy

5. Quantization and Mobile-side Acceleration

6. Performance Analysis and Task Generalization

7. Technical and Scientific Significance

Table: AndesVL Component Summary

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics