Spec-o3: Vision-Language Agent for Spectral Vetting

Updated 17 January 2026

Spec-o3 is a tool-augmented vision-language agent that automates rare celestial object vetting by employing interactive spectral visualization combined with chain-of-thought reasoning.
It integrates a large pretrained Qwen2.5-VL backbone with a functional API to enable focused analysis of wavelength intervals, addressing the scaling challenges of manual inspection.
A two-stage training protocol using supervised fine-tuning and outcome-based reinforcement learning achieves state-of-the-art performance with up to 76.5% macro-F1 across diverse survey domains.

Spec-o3 is a tool-augmented vision-language agent engineered for astronomer-aligned, automated vetting of rare celestial object candidates through multimodal spectral inspection. It addresses the scaling bottleneck imposed by manual expert inspection in the context of modern spectroscopic surveys, which generate vast volumes of data unsuitable for legacy human-in-the-loop workflows. Spec-o3 combines a large pretrained vision-language backbone (Qwen2.5-VL) with a functional API for interactive spectral visualization, orchestrated via interleaved chain-of-thought reasoning and tool usage. Its training protocol leverages curated expert inspection demonstrations and outcome-based reinforcement learning, resulting in state-of-the-art performance and generalization across multiple survey domains (Jia et al., 10 Jan 2026).

1. Architecture and System Workflow

Spec-o3 is built upon the Qwen2.5-VL vision-LLM, further augmented with a "spectral_visualization_tool" through a function-call API, enabling the agent to interactively render and zoom into specified wavelength intervals for spectral plots. The policy $\pi_\theta$ dictates the sequence of internal deliberations ("thought" blocks) and tool calls, operating on a context comprising both text and images.

At inference, a spectrum $S$ (1D array: wavelength, flux) is first visualized as a global plot $I_0$ , and the agent receives a textual prompt $T_0$ that asks a binary vetting question and lists key diagnostics. Spec-o3 proceeds through alternating steps:

Generating intermediate "think" blocks ( $T_n$ ),
Optionally invoking the spectral visualization tool (with $\Delta\lambda = (\lambda_\text{min}, \lambda_\text{max})$ ) to obtain focused zoomed-in images ( $I_{n+1}$ ),
Appending $(T_n, I_{n+1})$ to context,
Terminating via a final answer block.

Each plot is encoded as a context image object; tool calls are represented as JSON-style arguments and special tokens immune to training loss. The agent leverages both image encoder features and textual conditioning.

2. Multimodal Chain-of-Thought Reasoning

Spec-o3's reasoning engine formalizes an interactive multimodal chain-of-thought (iMCoT) protocol. At each step, the agent's state $s_t = \{ (I_k, T_k) \}_{k=0..t}$ evolves as:

$a_t \sim \pi_\theta(a\mid s_t)$

where $a_t$ is either a FinalAnswer or CallTool action. If CallTool is invoked, the spectral_visualization_tool generates $I_{t+1}$ , otherwise episode termination occurs with answer $T_N$ .

Representative pseudocode for the inference loop:

for each spectrum S:
    ctx = []                     # history of (T,I) pairs
    I0 = VizTool(S, FullRange)   # initial global plot
    ctx.append((PromptText, I0))
    done = False
    while not done:
        action, content = πθ(ctx)
        if action == "CALL_TOOL":
            λmin, λmax, label = parse(content)
            Inew = VizTool(S, [λmin, λmax], label)
            ctx.append((content, Inew))
        else:  # FinalAnswer
            answer = content
            done = True
    return answer

3. Post-Training and Optimization

Spec-o3 utilizes a two-stage post-training recipe:

Stage 1: Supervised Fine-Tuning (SFT) SFT is performed on a dataset $\mathcal{D}_\text{exp}$ of approximately 1,000 expert-verified iMCoT trajectories. The optimization minimizes masked cross-entropy loss over generated tokens, with tool-returned image tokens excluded:

$\mathcal{L}_\text{SFT}(\theta) = -\sum_{(x,y) \in \mathcal{D}_\text{exp}} \log \pi_\theta(y \mid x)$

Stage 2: Agentic Reinforcement Learning (RL) Agentic RL employs Group Relative Policy Optimization (GRPO) using thousands of spectra with yes/no outcome labels. A trajectory $\tau$ receives reward $R(\tau)$ based on both correctness and output formatting:

$R(\tau) = \begin{cases} +1, & \text{correct data, well formatted} \ 1-\alpha, & \text{correct data, bad format} \ 0, & \text{incorrect data, well formatted} \ -\alpha, & \text{incorrect data, bad format} \end{cases}$

The expected reward objective is:

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[ R(\tau) ]$

Loss masking applies to tool-returned tokens, targeting policy proficiency rather than memorization.

4. Spectral Visualization Tool Integration

The spectral_visualization_tool is defined via a Python-style API:

def spectral_visualization_tool(session_id: str,
                               lambda_min: float,
                               lambda_max: float,
                               label: Optional[str]=None
                              ) -> Image:
    ...

Inputs consist of persistent session IDs (internally caching $S$ ), wavelength intervals, and optional annotation labels. Outputs are PNG visualizations of flux versus wavelength within the specified bounds. Each returned image is represented as "<image id=...>" in LLM context; embeddings are processed using image encoder features aligned to textual tokens.

5. Evaluation Metrics and Comparative Results

Spec-o3's performance is evaluated on five rare-object identification tasks (CV, CS, SS, MG, WD) from LAMOST using the macro-F1 score:

$\mathrm{macro\text{-}F1} = \frac{1}{C}\sum_{c=1}^C \frac{2\,\mathrm{Prec}_c\,\mathrm{Rec}_c} {\mathrm{Prec}_c + \mathrm{Rec}_c}$

Key results indicate:

Spec-o3-7B achieves 76.5% macro-F1, outperforming o3-proprietary (52.3%), Qwen2.5-VL-7B (28.3%), and specialist nets (CarbonNet, AstroCLIP, GaiaNet ≈64–65%).
Cross-survey generalization (SDSS/DESI): F1 ≈81.1% (SDSS), 77.4% (DESI) for Spec-o3-7B, substantially ahead of o3.
Cross-task generalization (unseen spectral types): Spec-o3-7B scores 76.4% F1, compared to o3 (60.9%) and Qwen2.5-VL-7B (30.5%).
Ablation study: Removing SFT, RL, or tool access each severely degrades performance, confirming their necessity.

Model	Macro-F1
CarbonNet	64.3%
AstroCLIP	64.5%
GaiaNet	64.9%
GPT-4.1	29.8%
o3	52.3%
Qwen2.5-VL-7B	28.3%
Spec-o3-7B	76.5%

6. Interpretability, Expert Validation, and Transparency

Spec-o3 produces coherent multimodal reasoning traces, integrating spectral feature zooms (e.g., Hα at 6563 Å, FWHM > 1000 km/s) and physically consistent commentary, leading to explicit decisions ( $\boxed{YES}$ or $\boxed{NO}$ ). Example trace:

<image>
<think>"I see a broad Hα emission with FWHM≈1200 km/s—characteristic of a quiescent CV disk."</think>
<tool_call>{"lambda_min":6500,"lambda_max":6620}</tool_call>
<think>"Zoom confirms He II 4686 emission."</think>
<answer>\boxed{YES}</answer>

Six expert astronomers rated 100 trajectories for coherence and physical consistency; LLM judges (GPT-5, Gemini-2.5-Pro, Claude-4-Sonnet, Grok-4) achieved Spearman $\rho \geq 0.7$ correlation with human scores. In paired preference tests, Spec-o3 was favored over o3-proprietary in at least 80% of cases.

All decisions are supported by explicit multimodal traces, facilitating error analysis and auditability. Physical consistency is maintained by stepwise anchoring to specific spectral features.

7. Prospects and Ongoing Development

Spec-o3 is the first tool-augmented vision-language agent to accurately and interpretably automate rare object vetting through spectral inspection, replicating expert workflows. The two-stage training protocol—cold-start supervised fine-tuning and outcome-based RL—is indispensable for domain-specific tool mastery. Notable strengths include survey adaptation and generalization to novel spectral categories.

Anticipated enhancements comprise:

Extension to wider spectral classes (e.g., emission-line galaxies, quasars) and lower SNR domains,
Integration of additional tool modalities for photometry, time-series, and catalog cross-matching,
Implementation of confidence calibration and abstention mechanisms for risk control,
Scaling to next-generation surveys (WEAVE, 4MOST) via instrument-specific tool adaptations.

A plausible implication is that tool-augmented agents like Spec-o3 will become standard in high-throughput astronomical survey pipelines, reducing the reliance on manual expert vetting and enhancing catalog reliability (Jia et al., 10 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Vetting via Automated Spectral Inspection (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spec-o3.

Spec-o3: Vision-Language Agent for Spectral Vetting

1. Architecture and System Workflow

2. Multimodal Chain-of-Thought Reasoning

3. Post-Training and Optimization

4. Spectral Visualization Tool Integration

5. Evaluation Metrics and Comparative Results

6. Interpretability, Expert Validation, and Transparency

7. Prospects and Ongoing Development

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Spec-o3: Vision-Language Agent for Spectral Vetting

1. Architecture and System Workflow

2. Multimodal Chain-of-Thought Reasoning

3. Post-Training and Optimization

4. Spectral Visualization Tool Integration

5. Evaluation Metrics and Comparative Results

6. Interpretability, Expert Validation, and Transparency

7. Prospects and Ongoing Development

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research