Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

Published 31 Mar 2026 in cs.AI, cs.CL, and cs.CV | (2603.29211v1)

Abstract: In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems. The model adopts a compact InternViT-300M + MLP + Qwen3 1.7B architecture, balancing fine-grained visual perception, language-semantic alignment, and deployment cost within an approximately 2B-parameter budget. To balance business specialization with the retention of general capabilities, we developed a data iteration and curation mechanism and trained the model through a progressive three-stage pipeline: pre-training, mid-training, and post-training. Ablation studies and offline business evaluations show that Xuanwu VL-2B achieves an average score of 67.90 across seven OpenCompass multimodal metrics (vs. 64.27 for InternVL 3.5 2B), an average recall of 94.38% over seven independent business moderation tasks, and a weighted overall recall of 82.82% on policy-violating text in challenging adversarial OCR scenarios, outperforming Gemini-2.5-Pro (76.72%). These results show that, under a limited parameter budget, Xuanwu VL-2B achieves a practical balance among business alignment, visual perception, general capability retention, and deployment cost.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces Xuanwu VL-2B, a compact 2B-parameter model that balances fine-grained visual perception with robust business content moderation under compute constraints.
It leverages an InternViT-300M visual backbone and a progressive three-stage training pipeline, achieving significant improvements in both moderation recall (94.38%) and adversarial OCR (82.82%).
The study offers practical insights for scalable multimodal architectures, highlighting low inference costs and potential future enhancements in policy compliance and explainable AI.

Xuanwu VL-2B: A Compact, Industrial-Grade Multimodal Foundation Model for Content Ecosystems

Motivation and Problem Setting

The paper addresses the operational deficiencies of mainstream multimodal LLMs (MLLMs) in industrial content moderation and adversarial scenarios. While models such as LLaVA, Qwen-VL, and InternVL demonstrate strong open-domain cross-modal capabilities, their utility is compromised in real deployments due to inference overhead, catastrophic forgetting, insufficient granularity in moderation, and weak adversarial robustness. Key industrial requirements—high-precision moderation and adversarial-content defense—demand structures capable of fine-grained perceptual discrimination and robust policy alignment under compute constraints. Xuanwu VL-2B is proposed as a compact, approximately 2B-parameter multimodal foundation model that achieves a balance between fine-grained visual perception, language-semantic alignment, business performance, and inference cost.

Model Architecture

The core architecture adopts InternViT-300M as the visual backbone and Qwen3 1.7B as the language backbone, connected via a lightweight MLP projector, eschewing complex MoE or Q-Former connectors to maximize training stability and deployment efficiency. Quantitative evaluation across vision encoder candidates (InternViT-300M, AIMv2-Huge, SAILViT-Huge) establishes InternViT-300M as the optimal trade-off between multimodal capability, OCR robustness, and inference cost. Dual-encoder fusion experiments (e.g., pairing with GOT-ViT) indicate only marginal gains and increased overhead, hence a single vision backbone is retained.

Figure 1: Xuanwu VL-2B utilizes InternViT-300M as the visual backbone and Qwen3 1.7B as the language backbone, interfaced through an MLP projector.

A Dynamic High-Resolution Perception mechanism is implemented: images are adaptively tiled (up to $12 \times 448 \times 448$ tiles) based on aspect ratio, with global thumbnails preserved and visual features pixel-unshuffled to constrain token proliferation. This preserves both local granularity and global context under fixed compute budgets.

Training Pipeline

The model is trained via a progressive three-stage pipeline:

Pre-training: Foundation image-text alignment (1.3M high-quality alignment samples; 17.33M paired image-text samples), optimizing multimodal autoregressive cross-entropy loss.
Mid-training: Injection of business knowledge, instruction-following enhancement, moderation, and adversarial SPAM data; includes base-retention sampling to mitigate catastrophic forgetting, quality filtering via LLM judge and CLIP scoring, and diversity balancing with K-Means clustering.
Post-training: Supervised fine-tuning with high-fidelity SFT data curated by model-in-the-loop teacher verification, RL alignment using GRPO, and extensive Chain-of-Thought (CoT) annotation for explainable moderation.

The RL reward formulation combines classification, output-format compliance, and OCR alignment, specifically penalizing missed and hallucinated characters. GRPO alignment improves adversarial weighted OCR recall from 76.42% to 82.82%. CoT systematically structures reasoning output for transparent policy violation attribution.

Evaluation and Results

General Capability Retention

On seven OpenCompass multimodal metrics, Xuanwu VL-2B outperforms InternVL 3.5 2B (67.90 vs. 64.27). On text-only benchmarks, the model maintains a competitive average-9 score (58.38 vs. 59.02) but loses ground on C-Eval and arithmetic code tasks due to format non-compliance and business data compression.

Business Moderation

Average recall across seven moderation labels reaches 94.38%, a 46.4-point improvement over InternVL 3.5 2B. Recall is high across ad, high-risk, illegal, pornographic, vulgar, and normal categories, demonstrating robust specialization without sacrificing general perception.

Figure 2: Example illustrating detection of cursive diversion codes in complex product images, demonstrative of fine-grained adversarial OCR robustness.

Figure 3: Case of low-opacity contact information hidden in image layer as watermark, intercepted and decoded.

Adversarial OCR

Weighted overall recall in adversarial OCR scenarios (eight attack types) is 82.82%, significantly higher than commercial Gemini-2.5-Pro (76.72%) and InternVL baselines (64.79%). Notable improvements are shown in AIGC-fusion, noise, warp, and watermark attacks. Ablation studies confirm that SFT curation and GRPO alignment are critical for both recall and reduction in false positives.

Figure 4: Micro-font variant-character diversion—text shrunk to extreme scales for evasion.

Figure 5: Diversion keywords camouflaged in long prose paragraphs, demonstrating model's context-resilient extraction.

Figure 6: Severe distortion and warping, breaking character morphology to obfuscate detection, handled by Xuanwu VL-2B.

Qualitative Explainability

Chain-of-Thought outputs illustrate the model's multi-step reasoning: meticulous extraction of concealed cues (handwriting, watermarks, micro-font, noise camouflages), rationale attribution, and explainable policy enforcement.

Figure 7: AIGC-forgery—contact info embedded into image textures via generative diffusion models.

Figure 8: Combination-camouflage—group numbers hidden in benign academic backgrounds using symbolic indirection.

Figure 9: Concealed diversion text distributed beneath long greeting images, challenging perceptual pipelines.

Implications and Future Directions

Practical Implications

Xuanwu VL-2B demonstrates that industrial-grade multimodal moderation can be achieved under tight parameter and compute budgets by integrating fine-grained perception, robust alignment, and business-oriented data curation. The architecture is deployable at scale with low VRAM/latency overhead, showing superior recall and explainability in adversarial settings. Immediate applications include content moderation, adversarial forgery interception, and explainable policy compliance on social platforms.

Theoretical Implications

The results challenge prevailing assumptions that capability and robustness in adversarial business scenarios require massive parameter expansion or complex MoE structures. Careful vision encoder selection, tailored pipeline progression, and reward-driven RL allow retention of general capabilities while achieving strong domain specialization and robustness.

Speculation on Future Developments

Plans include moving toward native multimodal architectures (unified token sequences), full sensory integration (audio/video), global multilingual expansion, and further quantized inference optimizations. This may enable more agile, agentic content ecosystem foundations, with continuous self-adversarial evolution, policy distillation, and adaptation to new regulatory/unethical threats.

Limitations

Despite strong performance, the model remains susceptible to extreme edge cases: ultra-dense overlapping watermarks, pixel-level hidden text beyond perceptual resolution, and logical shortcut errors in long-context chain-of-thought outputs. Further improvements could target receptive field expansion and format compliance.

Conclusion

The study establishes Xuanwu VL-2B as a robust, compact foundation for multimodal content ecosystems, achieving high recall in business content moderation and adversarial OCR scenarios while retaining general capability. The architectural and procedural methodologies offer a blueprint for balancing domain adaptation, robustness, and efficiency in future multimodal agent systems, pointing toward increasingly adaptive and explainable industrial moderation solutions (2603.29211).

Markdown Report Issue