Papers
Topics
Authors
Recent
Search
2000 character limit reached

Spec-o3: Vision-Language Agent for Spectral Vetting

Updated 17 January 2026
  • Spec-o3 is a tool-augmented vision-language agent that automates rare celestial object vetting by employing interactive spectral visualization combined with chain-of-thought reasoning.
  • It integrates a large pretrained Qwen2.5-VL backbone with a functional API to enable focused analysis of wavelength intervals, addressing the scaling challenges of manual inspection.
  • A two-stage training protocol using supervised fine-tuning and outcome-based reinforcement learning achieves state-of-the-art performance with up to 76.5% macro-F1 across diverse survey domains.

Spec-o3 is a tool-augmented vision-language agent engineered for astronomer-aligned, automated vetting of rare celestial object candidates through multimodal spectral inspection. It addresses the scaling bottleneck imposed by manual expert inspection in the context of modern spectroscopic surveys, which generate vast volumes of data unsuitable for legacy human-in-the-loop workflows. Spec-o3 combines a large pretrained vision-language backbone (Qwen2.5-VL) with a functional API for interactive spectral visualization, orchestrated via interleaved chain-of-thought reasoning and tool usage. Its training protocol leverages curated expert inspection demonstrations and outcome-based reinforcement learning, resulting in state-of-the-art performance and generalization across multiple survey domains (Jia et al., 10 Jan 2026).

1. Architecture and System Workflow

Spec-o3 is built upon the Qwen2.5-VL vision-LLM, further augmented with a "spectral_visualization_tool" through a function-call API, enabling the agent to interactively render and zoom into specified wavelength intervals for spectral plots. The policy πθ\pi_\theta dictates the sequence of internal deliberations ("thought" blocks) and tool calls, operating on a context comprising both text and images.

At inference, a spectrum SS (1D array: wavelength, flux) is first visualized as a global plot I0I_0, and the agent receives a textual prompt T0T_0 that asks a binary vetting question and lists key diagnostics. Spec-o3 proceeds through alternating steps:

  • Generating intermediate "think" blocks (TnT_n),
  • Optionally invoking the spectral visualization tool (with Δλ=(λmin,λmax)\Delta\lambda = (\lambda_\text{min}, \lambda_\text{max})) to obtain focused zoomed-in images (In+1I_{n+1}),
  • Appending (Tn,In+1)(T_n, I_{n+1}) to context,
  • Terminating via a final answer block.

Each plot is encoded as a context image object; tool calls are represented as JSON-style arguments and special tokens immune to training loss. The agent leverages both image encoder features and textual conditioning.

2. Multimodal Chain-of-Thought Reasoning

Spec-o3's reasoning engine formalizes an interactive multimodal chain-of-thought (iMCoT) protocol. At each step, the agent's state st={(Ik,Tk)}k=0..ts_t = \{ (I_k, T_k) \}_{k=0..t} evolves as:

atπθ(ast)a_t \sim \pi_\theta(a\mid s_t)

where ata_t is either a FinalAnswer or CallTool action. If CallTool is invoked, the spectral_visualization_tool generates It+1I_{t+1}, otherwise episode termination occurs with answer TNT_N.

Representative pseudocode for the inference loop:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
for each spectrum S:
    ctx = []                     # history of (T,I) pairs
    I0 = VizTool(S, FullRange)   # initial global plot
    ctx.append((PromptText, I0))
    done = False
    while not done:
        action, content = πθ(ctx)
        if action == "CALL_TOOL":
            λmin, λmax, label = parse(content)
            Inew = VizTool(S, [λmin, λmax], label)
            ctx.append((content, Inew))
        else:  # FinalAnswer
            answer = content
            done = True
    return answer

3. Post-Training and Optimization

Spec-o3 utilizes a two-stage post-training recipe:

  • Stage 1: Supervised Fine-Tuning (SFT) SFT is performed on a dataset Dexp\mathcal{D}_\text{exp} of approximately 1,000 expert-verified iMCoT trajectories. The optimization minimizes masked cross-entropy loss over generated tokens, with tool-returned image tokens excluded:

LSFT(θ)=(x,y)Dexplogπθ(yx)\mathcal{L}_\text{SFT}(\theta) = -\sum_{(x,y) \in \mathcal{D}_\text{exp}} \log \pi_\theta(y \mid x)

R(τ)={+1,correct data, well formatted 1α,correct data, bad format 0,incorrect data, well formatted α,incorrect data, bad formatR(\tau) = \begin{cases} +1, & \text{correct data, well formatted} \ 1-\alpha, & \text{correct data, bad format} \ 0, & \text{incorrect data, well formatted} \ -\alpha, & \text{incorrect data, bad format} \end{cases}

The expected reward objective is:

J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[ R(\tau) ]

Loss masking applies to tool-returned tokens, targeting policy proficiency rather than memorization.

4. Spectral Visualization Tool Integration

The spectral_visualization_tool is defined via a Python-style API:

1
2
3
4
5
6
def spectral_visualization_tool(session_id: str,
                               lambda_min: float,
                               lambda_max: float,
                               label: Optional[str]=None
                              ) -> Image:
    ...

Inputs consist of persistent session IDs (internally caching SS), wavelength intervals, and optional annotation labels. Outputs are PNG visualizations of flux versus wavelength within the specified bounds. Each returned image is represented as "<image id=...>" in LLM context; embeddings are processed using image encoder features aligned to textual tokens.

5. Evaluation Metrics and Comparative Results

Spec-o3's performance is evaluated on five rare-object identification tasks (CV, CS, SS, MG, WD) from LAMOST using the macro-F1 score:

macro-F1=1Cc=1C2PreccReccPrecc+Recc\mathrm{macro\text{-}F1} = \frac{1}{C}\sum_{c=1}^C \frac{2\,\mathrm{Prec}_c\,\mathrm{Rec}_c} {\mathrm{Prec}_c + \mathrm{Rec}_c}

Key results indicate:

  • Spec-o3-7B achieves 76.5% macro-F1, outperforming o3-proprietary (52.3%), Qwen2.5-VL-7B (28.3%), and specialist nets (CarbonNet, AstroCLIP, GaiaNet ≈64–65%).
  • Cross-survey generalization (SDSS/DESI): F1 ≈81.1% (SDSS), 77.4% (DESI) for Spec-o3-7B, substantially ahead of o3.
  • Cross-task generalization (unseen spectral types): Spec-o3-7B scores 76.4% F1, compared to o3 (60.9%) and Qwen2.5-VL-7B (30.5%).
  • Ablation study: Removing SFT, RL, or tool access each severely degrades performance, confirming their necessity.
Model Macro-F1
CarbonNet 64.3%
AstroCLIP 64.5%
GaiaNet 64.9%
GPT-4.1 29.8%
o3 52.3%
Qwen2.5-VL-7B 28.3%
Spec-o3-7B 76.5%

6. Interpretability, Expert Validation, and Transparency

Spec-o3 produces coherent multimodal reasoning traces, integrating spectral feature zooms (e.g., Hα at 6563 Å, FWHM > 1000 km/s) and physically consistent commentary, leading to explicit decisions (YES\boxed{YES} or NO\boxed{NO}). Example trace:

1
2
3
4
5
<image>
<think>"I see a broad Hα emission with FWHM≈1200 km/s—characteristic of a quiescent CV disk."</think>
<tool_call>{"lambda_min":6500,"lambda_max":6620}</tool_call>
<think>"Zoom confirms He II 4686 emission."</think>
<answer>\boxed{YES}</answer>

Six expert astronomers rated 100 trajectories for coherence and physical consistency; LLM judges (GPT-5, Gemini-2.5-Pro, Claude-4-Sonnet, Grok-4) achieved Spearman ρ0.7\rho \geq 0.7 correlation with human scores. In paired preference tests, Spec-o3 was favored over o3-proprietary in at least 80% of cases.

All decisions are supported by explicit multimodal traces, facilitating error analysis and auditability. Physical consistency is maintained by stepwise anchoring to specific spectral features.

7. Prospects and Ongoing Development

Spec-o3 is the first tool-augmented vision-language agent to accurately and interpretably automate rare object vetting through spectral inspection, replicating expert workflows. The two-stage training protocol—cold-start supervised fine-tuning and outcome-based RL—is indispensable for domain-specific tool mastery. Notable strengths include survey adaptation and generalization to novel spectral categories.

Anticipated enhancements comprise:

  • Extension to wider spectral classes (e.g., emission-line galaxies, quasars) and lower SNR domains,
  • Integration of additional tool modalities for photometry, time-series, and catalog cross-matching,
  • Implementation of confidence calibration and abstention mechanisms for risk control,
  • Scaling to next-generation surveys (WEAVE, 4MOST) via instrument-specific tool adaptations.

A plausible implication is that tool-augmented agents like Spec-o3 will become standard in high-throughput astronomical survey pipelines, reducing the reliance on manual expert vetting and enhancing catalog reliability (Jia et al., 10 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spec-o3.