Spec-o3: Vision-Language Agent for Spectral Vetting
- Spec-o3 is a tool-augmented vision-language agent that automates rare celestial object vetting by employing interactive spectral visualization combined with chain-of-thought reasoning.
- It integrates a large pretrained Qwen2.5-VL backbone with a functional API to enable focused analysis of wavelength intervals, addressing the scaling challenges of manual inspection.
- A two-stage training protocol using supervised fine-tuning and outcome-based reinforcement learning achieves state-of-the-art performance with up to 76.5% macro-F1 across diverse survey domains.
Spec-o3 is a tool-augmented vision-language agent engineered for astronomer-aligned, automated vetting of rare celestial object candidates through multimodal spectral inspection. It addresses the scaling bottleneck imposed by manual expert inspection in the context of modern spectroscopic surveys, which generate vast volumes of data unsuitable for legacy human-in-the-loop workflows. Spec-o3 combines a large pretrained vision-language backbone (Qwen2.5-VL) with a functional API for interactive spectral visualization, orchestrated via interleaved chain-of-thought reasoning and tool usage. Its training protocol leverages curated expert inspection demonstrations and outcome-based reinforcement learning, resulting in state-of-the-art performance and generalization across multiple survey domains (Jia et al., 10 Jan 2026).
1. Architecture and System Workflow
Spec-o3 is built upon the Qwen2.5-VL vision-LLM, further augmented with a "spectral_visualization_tool" through a function-call API, enabling the agent to interactively render and zoom into specified wavelength intervals for spectral plots. The policy dictates the sequence of internal deliberations ("thought" blocks) and tool calls, operating on a context comprising both text and images.
At inference, a spectrum (1D array: wavelength, flux) is first visualized as a global plot , and the agent receives a textual prompt that asks a binary vetting question and lists key diagnostics. Spec-o3 proceeds through alternating steps:
- Generating intermediate "think" blocks (),
- Optionally invoking the spectral visualization tool (with ) to obtain focused zoomed-in images (),
- Appending to context,
- Terminating via a final answer block.
Each plot is encoded as a context image object; tool calls are represented as JSON-style arguments and special tokens immune to training loss. The agent leverages both image encoder features and textual conditioning.
2. Multimodal Chain-of-Thought Reasoning
Spec-o3's reasoning engine formalizes an interactive multimodal chain-of-thought (iMCoT) protocol. At each step, the agent's state evolves as:
where is either a FinalAnswer or CallTool action. If CallTool is invoked, the spectral_visualization_tool generates , otherwise episode termination occurs with answer .
Representative pseudocode for the inference loop:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
for each spectrum S: ctx = [] # history of (T,I) pairs I0 = VizTool(S, FullRange) # initial global plot ctx.append((PromptText, I0)) done = False while not done: action, content = πθ(ctx) if action == "CALL_TOOL": λmin, λmax, label = parse(content) Inew = VizTool(S, [λmin, λmax], label) ctx.append((content, Inew)) else: # FinalAnswer answer = content done = True return answer |
3. Post-Training and Optimization
Spec-o3 utilizes a two-stage post-training recipe:
- Stage 1: Supervised Fine-Tuning (SFT) SFT is performed on a dataset of approximately 1,000 expert-verified iMCoT trajectories. The optimization minimizes masked cross-entropy loss over generated tokens, with tool-returned image tokens excluded:
- Stage 2: Agentic Reinforcement Learning (RL) Agentic RL employs Group Relative Policy Optimization (GRPO) using thousands of spectra with yes/no outcome labels. A trajectory receives reward based on both correctness and output formatting:
The expected reward objective is:
Loss masking applies to tool-returned tokens, targeting policy proficiency rather than memorization.
4. Spectral Visualization Tool Integration
The spectral_visualization_tool is defined via a Python-style API:
1 2 3 4 5 6 |
def spectral_visualization_tool(session_id: str, lambda_min: float, lambda_max: float, label: Optional[str]=None ) -> Image: ... |
Inputs consist of persistent session IDs (internally caching ), wavelength intervals, and optional annotation labels. Outputs are PNG visualizations of flux versus wavelength within the specified bounds. Each returned image is represented as "<image id=...>" in LLM context; embeddings are processed using image encoder features aligned to textual tokens.
5. Evaluation Metrics and Comparative Results
Spec-o3's performance is evaluated on five rare-object identification tasks (CV, CS, SS, MG, WD) from LAMOST using the macro-F1 score:
Key results indicate:
- Spec-o3-7B achieves 76.5% macro-F1, outperforming o3-proprietary (52.3%), Qwen2.5-VL-7B (28.3%), and specialist nets (CarbonNet, AstroCLIP, GaiaNet ≈64–65%).
- Cross-survey generalization (SDSS/DESI): F1 ≈81.1% (SDSS), 77.4% (DESI) for Spec-o3-7B, substantially ahead of o3.
- Cross-task generalization (unseen spectral types): Spec-o3-7B scores 76.4% F1, compared to o3 (60.9%) and Qwen2.5-VL-7B (30.5%).
- Ablation study: Removing SFT, RL, or tool access each severely degrades performance, confirming their necessity.
| Model | Macro-F1 |
|---|---|
| CarbonNet | 64.3% |
| AstroCLIP | 64.5% |
| GaiaNet | 64.9% |
| GPT-4.1 | 29.8% |
| o3 | 52.3% |
| Qwen2.5-VL-7B | 28.3% |
| Spec-o3-7B | 76.5% |
6. Interpretability, Expert Validation, and Transparency
Spec-o3 produces coherent multimodal reasoning traces, integrating spectral feature zooms (e.g., Hα at 6563 Å, FWHM > 1000 km/s) and physically consistent commentary, leading to explicit decisions ( or ). Example trace:
1 2 3 4 5 |
<image>
<think>"I see a broad Hα emission with FWHM≈1200 km/s—characteristic of a quiescent CV disk."</think>
<tool_call>{"lambda_min":6500,"lambda_max":6620}</tool_call>
<think>"Zoom confirms He II 4686 emission."</think>
<answer>\boxed{YES}</answer> |
Six expert astronomers rated 100 trajectories for coherence and physical consistency; LLM judges (GPT-5, Gemini-2.5-Pro, Claude-4-Sonnet, Grok-4) achieved Spearman correlation with human scores. In paired preference tests, Spec-o3 was favored over o3-proprietary in at least 80% of cases.
All decisions are supported by explicit multimodal traces, facilitating error analysis and auditability. Physical consistency is maintained by stepwise anchoring to specific spectral features.
7. Prospects and Ongoing Development
Spec-o3 is the first tool-augmented vision-language agent to accurately and interpretably automate rare object vetting through spectral inspection, replicating expert workflows. The two-stage training protocol—cold-start supervised fine-tuning and outcome-based RL—is indispensable for domain-specific tool mastery. Notable strengths include survey adaptation and generalization to novel spectral categories.
Anticipated enhancements comprise:
- Extension to wider spectral classes (e.g., emission-line galaxies, quasars) and lower SNR domains,
- Integration of additional tool modalities for photometry, time-series, and catalog cross-matching,
- Implementation of confidence calibration and abstention mechanisms for risk control,
- Scaling to next-generation surveys (WEAVE, 4MOST) via instrument-specific tool adaptations.
A plausible implication is that tool-augmented agents like Spec-o3 will become standard in high-throughput astronomical survey pipelines, reducing the reliance on manual expert vetting and enhancing catalog reliability (Jia et al., 10 Jan 2026).