Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 28 tok/s
GPT-5 High 28 tok/s Pro
GPT-4o 81 tok/s
GPT OSS 120B 453 tok/s Pro
Kimi K2 229 tok/s Pro
2000 character limit reached

Intern-S1: A Scientific Multimodal Foundation Model (2508.15763v2)

Published 21 Aug 2025 in cs.LG, cs.CL, and cs.CV

Abstract: In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward AGI, we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training. On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces Intern-S1, an open-source multimodal model with 28B activated parameters, achieving state-of-the-art scientific reasoning.
  • It leverages a Mixture-of-Experts architecture with dynamic tokenization and specialized encoders to integrate images, text, and time series data.
  • Optimized with both offline and online reinforcement learning, Intern-S1 outperforms closed-source models on a range of scientific benchmarks.

Intern-S1: A Scientific Multimodal Foundation Model

Introduction

Intern-S1 is designed as an open-source multimodal foundation model capable of bridging the gap between open-source and closed-source models in scientific domains. The paper introduces Intern-S1 as a specialized generalist model equipped with advanced understanding and reasoning capabilities tailored for scientific data analysis across multiple modalities. The architecture leverages a Mixture-of-Experts (MoE) model structure, encompassing 28 billion activated parameters, and is trained on a substantial corpus of tokens derived from scientific domains.

Intern-S1 aims to address the inherent challenges in scientific research, such as comprehending diverse scientific modalities and performing rigorous reasoning processes. The model distinguishes itself by outperforming closed-source models in professional tasks specific to scientific domains, indicating its high efficacy in specialized tasks. Figure 1

Figure 1: Performance comparison among open-source and close-source models on Image-text and Text-only Benchmarks. Results demonstrate that Intern-S1 has a top-tier general reasoning capability among open-source models and outperforms closed-source models in scientific domains.

Model Architecture

Intern-S1 employs a sophisticated architecture incorporating:

  • Vision Transformer (ViT): Utilized for processing images, providing fine-grained visual representations, critical in scientific data analysis.
  • Dynamic Tokenizer: Designed to optimize encoding efficiency by segmenting scientific data into orthogonal embedding spaces, mitigating issues such as compression ratio reduction and context sensitivity.
  • Time Series Encoder: Facilitates the handling of sequential numerical data typical in scientific measurements. Figure 2

    Figure 2: Architecture of Intern-S1, consisting of a MoE LLM with a vision encoder, a time-series encoder, and a dynamic tokenizer that switches the tokenization and embedding strategies for natural language and scientific inputs.

Data and Training Strategy

Intern-S1's training involves a meticulously curated data pipeline emphasizing scientific data contribution:

  • PDF Document Parsing: A page-level parsing pipeline integrates low and high-cost parsers to extract high-quality scientific data.
  • Domain-centric Web Data Parsing: Customized strategies for unique URL domains are employed to improve data extraction accuracy.
  • Batch Size Strategy: Utilizing a batch size warmup strategy balances optimal training efficiency with model performance. Figure 3

Figure 3

Figure 3: Left: The workflow of the dynamic tokenizer. Right: The compression ratio of different tokenizers on scientific data (SMILES format). Intern-S1 outperforms others over 70\%.

Reinforcement Learning (RL) Optimization

Intern-S1 undergoes a two-stage RL optimization process:

  • Offline RL: Involving supervised fine-tuning using best-of-N sampling, enhancing the model's performance across various linguistic and scientific domains.
  • Online RL: Utilizes the Mixture-of-Rewards (MoR) framework, facilitating simultaneous learning across over 1000 tasks. This framework harmonizes diverse reward signals, crucial for domain-specialized and general-purpose task optimization. Figure 4

    Figure 4: The Mixture-of-Rewards framework.

Evaluation and Performance

Intern-S1 is evaluated on a comprehensive suite of benchmarks for general and scientific reasoning. The results demonstrate that Intern-S1 consistently outperforms previous open-source models and is competitive with proprietary APIs, especially in scientific reasoning.

  • General Reasoning: Achieves superior performance on text-only benchmarks like MMLU-Pro and AIME2025.
  • Scientific Reasoning: Sets a new standard on both text-only and multimodal benchmarks across diverse domains such as chemistry, materials science, and physics. Figure 5

    Figure 5: Performance trend of LLMs across popular and low-resource (science) tasks.

Conclusion

Intern-S1 represents a significant step forward in enhancing open-source models' capabilities in scientific reasoning while maintaining competitiveness in general reasoning tasks. Through comprehensive integration of multimodal data and advanced training strategies, Intern-S1 is poised to accelerate scientific discovery and innovation. The release of the model and accompanying toolchains is anticipated to catalyze future explorations in AI-driven scientific research, pushing towards the goal of achieving AGI.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

alphaXiv

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube