AMC: Automated Mission Classifier
- AMC is a tool that identifies and categorizes telescope references in astronomical literature by combining lightweight IR techniques with prompt-engineered LLMs.
- It features a modular pipeline—including text ingestion, snippet extraction, reranking, and structured LLM classification—that ensures high precision and transparent auditability.
- Its scalable, cost-effective design supports diverse astronomical archives and operational contexts, such as JWST preprints and NASA mission bibliographies.
The Automated Mission Classifier (AMC) is a LLM-powered tool designed to identify and categorize references to telescopes within astronomical literature. By combining lightweight information retrieval (IR) techniques with prompt-engineered LLMs, AMC enables robust and scalable annotation of telescope bibliographies, facilitating measurement of the scientific impact of facilities and archives. Originally developed to automate the identification of papers associated with NASA missions, the software has since been generalized for use across multiple telescopes and archives and demonstrates both high accuracy and practical efficiency in production environments (Wu et al., 12 Dec 2025).
1. System Structure and Processing Pipeline
AMC operates through a modular, retrieval-augmented pipeline optimized for both precision and interpretability. The workflow is hierarchically structured as follows:
- Text Ingestion: Documents are parsed as concatenations of title, abstract, and body (plain text). Sentences are split using the NLTK/Punkt tokenizer.
- Keyword-Based Snippet Extraction: Case-insensitive string search identifies telescope and instrument keywords. Each match generates a context window of $2n + 1$ sentences (where by default). If no keywords are detected, AMC immediately classifies the document as “not_telescope.”
- Reranking (Second-Stage Retrieval): Each snippet undergoes scoring via gpt-4.1-nano, using a restricted “Yes/No” vocabulary. The prompt asks whether the snippet discusses the telescope in a manner relevant to paper classification. The top snippets are retained.
- LLM Classification: Reranked snippets and their scores are concatenated into a single prompt, which is processed by gpt-5-mini (zero/few-shot, no finetuning). The prompt includes TRACS-derived definitions of four paper types (science, instrumentation, mention, not_telescope) and a pydantic schema that enforces structured output: Booleans for each type, supporting quotes, and free-text reasoning. For TRACS, reasoning and labels are combined in one call; original AMC supports a two-stage reasoning→score flow as well as floating-point “science” scores.
- Prompt Engineering: In-context examples, refined definitions, and iterative development structure all prompts; no model finetuning is used. The TRACS definitions are condensed via claude-sonnet-4.1.
This retrieval-augmented and schema-constrained approach sharply reduces hallucinations and enables granular auditability through quoted evidence and explicit reasoning.
2. Data Sources and Annotation Framework
AMC has been evaluated and deployed on several datasets:
- TRACS Shared Task Corpus (WASP/IJCNLP-AACL 2025): Training set consists of 80,385 entries, test set of 9,194. Each paper targets candidate telescopes (CHANDRA, HST, JWST, or NONE) and is annotated with four boolean type flags: science, instrumentation, mention, not_telescope (multiple true labels allowed except science and not_telescope). Dataset exhibits substantial missingness: 3% lack abstract, 19% lack body, >90% lack grants. No data augmentation is performed; irrelevant fields such as author lists are dropped.
- JWST Preprints Golden Sample: ~114 papers, balanced between science and non-science, achieving in production at STScI.
“Training” in AMC refers exclusively to prompt and hyperparameter tuning; there is no gradient-based optimization of the LLM backbone. All parameters—such as snippet context (), top-k threshold (), boolean thresholds ($0.5$ for science score), and lack of reranker cutoff—are heuristically selected. For 9,194 test rows, runtime is less than 24 hours wall time with approximate cost of \$10.
3. Performance Evaluation and Metrics
AMC adopts the standard multi-class classification metrics:
- For label :
- Macro : (for classes)
Reported TRACS results include initial submission macro , with subsequent prompt and reranker refinements pushing macro , securing third place in the Kaggle leaderboard. Confusion matrices for a random subset (, 25 per telescope) highlight model strengths (e.g., accurate CHANDRA/science detection) and weaknesses (e.g., difficulty with CHANDRA/mention).
A plausible implication is that retrieval-augmented, prompt-tuned pipelines offer competitive performance in zero/few-shot bibliometric tasks with minimal cost.
4. Architectural Innovations and Error Analysis
AMC integrates several novel modifications:
- Retrieval-Augmented Pipeline: Layered IR and LLM processing, combining keyword filtering, reranking, and LLM classification to maximize interpretability and scalability.
- Structured Output via Pydantic: Enforces strict JSON schemas for output, substantially reducing hallucinations and facilitating downstream integration.
- Continuous Scoring Option: The original AMC permits a “science_score” in , enabling threshold tuning for custom use cases.
- Ensembling and Adjudication: Dual prompt variants are reconciled using gpt-5-mini as “judge,” increasing robustness.
- Dataset-Error Detection: Systematically surfacing LLM-label disagreements uncovers annotation errors (e.g., conflicting science/mention tags, misclassified references). These inspections led to iterative improvements in prompts and guidance rules.
This evidence-based approach to error detection and correction supports the development of more reliable bibliographic tools.
5. Scalability, Deployment, and Generalization
AMC is architected for high-throughput and low-cost operation:
- Throughput: Processes ~380 papers/hour on a standard laptop or VM without GPU acceleration.
- Cost: Annotates 9,194 papers for ≈\$10 (breakdown: 22% reranker, 37% classification prompt, 41% completion).
- Batch Mode: Asynchronous pipelines can further optimize per-unit cost.
- Software Stack: Python, NLTK, pydantic, OpenAI API (gpt-4.1-nano and gpt-5-mini). Source code is publicly available (github.com/jwuphysics/automated-mission-classifier).
- Generalization: AMC is adaptable by simply substituting keyword lists and prompt definitions. It has proven effective for JWST preprints (production), as well as bibliographies from TESS, Pan-STARRS, and GALEX archives. For archival missions, analysis using full-body text is essential, as titles/abstracts provide insufficient granularity for detecting archival usage.
This modularity and efficiency facilitate integration into diverse bibliometric workflows.
6. Use Cases, Practical Impact, and Future Directions
AMC has been deployed in several operational contexts:
- Library Workflows: STScI utilizes AMC for automated flagging of JWST science preprints and DOI compliance checks. Its high-recall filters enable librarians to focus on true positives, while structured quote and reasoning fields enhance manual verification and debugging.
- Label Verification: The system’s ability to rapidly surface ambiguous or erroneous annotations provides feedback for improving both models and human guidance criteria.
- Extensions and Further Work: Section 5 of the reference paper outlines ongoing development paths:
- Meta-optimization of prompts via DSPy or GEPA-style reflective loops.
- Design of LLM agents that orchestrate pipeline stages with persistent memory.
- Replacement of rerankers with ColBERT/SciBERT models, or classical TF–IDF, for reduced cost and support for local execution.
- Expansion to address free-form bibliometric questions (e.g., quantifying archival-science output over time).
This suggests AMC is part of a broader trend toward semiautomated, interpretable, and cost-effective tools for scientific bibliography management. Its modular IR + LLM architecture, enforced schema, and reasoning output offer both classification utility and transparent audit trails, advancing the scope of scalable library sciences (Wu et al., 12 Dec 2025).