TRACS Kaggle Challenge

Updated 19 December 2025

TRACS Kaggle Challenge is a public machine learning competition that classifies astronomy papers by assigning multi-label categories to paper–telescope pairs.
It employs a three-stage pipeline with NLTK-based keyword filtering, snippet reranking using a lightweight LLM, and final classification via advanced prompt engineering that achieved a macro F1 score of 0.84.
The approach demonstrates that retrieval-augmented LLM inference can efficiently generate telescope-specific bibliographies despite challenges with incomplete metadata and alias handling.

The TRACS Kaggle Challenge is a public machine learning competition focused on the automated classification of astronomy research papers by telescope usage. The challenge evaluates models on their ability to assign fine-grained, multi-label categories at the level of paper–telescope pairs, enabling scalable construction of telescope-specific bibliographies for research impact assessment and archival purposes. The strongest approaches rely on both targeted retrieval of relevant text and sophisticated LLM inference.

1. Problem Formulation and Dataset Construction

The TRACS Kaggle Challenge task involves, for each (paper, candidate telescope) pair, predicting four Boolean labels: science (paper directly uses telescope data for scientific results), instrumentation (paper describes telescope or instrument engineering), mention (paper refers to the telescope without presenting new results), and not_telescope (false-positive, e.g., “Hubble diagram”). Predictions must be made for four telescopes: CHANDRA, HST, JWST, or NONE.

The dataset includes a training set of 80,385 entries and a test set of 9,194 entries, with each entry providing paper metadata (authors, year, title), body fields (abstract, body, acknowledgments, grants), the candidate telescope, and ground-truth labels (training only). The test set displays some incompleteness: approximately 3% have missing abstracts, about 19% lack a body field, and over 90% lack grant information. Input preprocessing involves concatenating title, abstract, and body, followed by sentence splitting using the NLTK Punkt tokenizer. High-recall keyword filtering (case-insensitive search for telescope names expanded to ±3 neighboring sentences, described as “snippets”) is a core strategy. If no snippet contains any relevant keyword, the model outputs not_telescope (Wu et al., 12 Dec 2025).

2. Model Pipeline and System Architecture

The leading solution, based on the Automated Mission Classifier (amc), operationalizes a three-stage pipeline:

Keyword Filtering: For each paper–telescope input, NLTK-based sentence segmentation is applied, followed by case-insensitive keyword search; snippets comprising up to seven sentences centered on each keyword occurrence are extracted.
Snippet Reranking: Each snippet is scored for relevance using a lightweight gpt-4.1-nano model, which outputs a Yes/No response to the prompt: “Does this snippet discuss the telescope in a way relevant for classification?” The top $k$ = 15 highest-scoring snippets are retained for each input.
Final Classification: Using the OpenAI gpt-5-mini API, a prompt is constructed comprising the selected snippets, explicit label definitions, and task examples; the LLM outputs binary values for the four categories, supporting quote strings, and free-text reasoning. Classification is structured via a pydantic-enforced schema.

This approach does not involve gradient-based fine-tuning nor training epochs; all learning is realized through prompt engineering and zero/few-shot inference. The absence of telescope-specific prompts or reranker parameter tuning (beyond k and the decision to enforce/not enforce snippet presence) characterizes the submitted version. The entire test set was processed in less than 24 wall-clock hours at a cost of approximately US$10 using hosted API infrastructure (Wu et al., 12 Dec 2025).

3. Evaluation Metrics and Performance

Performance is evaluated by macro $F_1 $score, averaged first over the four paper-type labels and then over the four telescope-name labels (including NONE). For each class$ i $, precision and recall are defined as:$ \text{precision}_i = \frac{TP_i}{TP_i+FP_i}, \qquad \text{recall}_i = \frac{TP_i}{TP_i+FN_i} $and</p> <p>$ {F_1}_i = 2\frac{\text{precision}_i \cdot \text{recall}_i}{\text{precision}_i+\text{recall}_i} $</p> <p>The leaderboard-reported macro$ F_1 $(paper types and telescope names averaged) for the best amc-based submission is 0.84 (<a href="/papers/2512.11202" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Wu et al., 12 Dec 2025</a>). <a href="https://www.emergentmind.com/topics/prompt-tuning" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Prompt tuning</a> and removal of a reranker threshold (which formerly forced not_telescope if no snippets were found) boosted macro$ F_1$ from 0.80 to 0.84. Per-class performance metrics on the held-out test set are not reported in the reference data.

4. Ablations, Comparative Analysis, and Baseline Position

Removing the reranker threshold (so that empty snippet collections do not automatically force not_telescope) accounted for the observed performance gain. Attempts to meta-judge via a second gpt-5-mini pass delivered negligible increase. No official public baseline beyond naive keyword matching is reported by challenge organizers; amc demonstrates one of the first LLM-based solutions for this telescope-bibliography-specific text classification problem.

On the public leaderboard, the amc pipeline achieved third place. Limitations identified include a single prompt for all telescopes (no per-telescope optimization), lack of telescope alias handling (leading to recall loss), dependence on commercial API for LLM inference, and the use of discrete rather than probabilistic labels (Wu et al., 12 Dec 2025).

5. Error Analysis and Limitations

Analysis using confusion matrices on 100 random validation examples per telescope (excluding NONE) revealed higher false-negative rates for CHANDRA/mention (confusion between technical usage and mere mention), while HST/instrumentation was correctly predicted in most cases. Inspection of training data (Table 1 in the source) surfaces label inconsistencies, such as simultaneous science and mention flags for a given paper–telescope pairing and mislabeling due to telescope aliasing (e.g., “Next Generation Space Telescope” for JWST marked as not_telescope). Broad estimation of the dataset's global label error rate remains unattainable due to the absence of a golden consensus sample.

Cases where the body field is missing pose fundamental limits on possible recall for science/instrumentation predictions, as does the use of discrete labels for intrinsically ambiguous cases, e.g., borderline science vs. mention papers. Missing rare telescope aliases during keyword filtering contributes to recall degradation (Wu et al., 12 Dec 2025).

6. Reproducibility and Implementation Details

The codebase is available on GitHub at https://github.com/jwuphysics/automated-mission-classifier, with the TRACS Kaggle-specific fork at https://github.com/jwuphysics/tracs_wasp2025. The primary dependencies are openai (API client), pydantic (output schema enforcement), and nltk (sentence splitting), besides standard Python libraries. Predictions require only CSV input, OpenAI API key, and choice of LLM and reranker parameters. The full test set can be evaluated in ~24 hours on the OpenAI API (~US$10), and the process is parallelizable by batch submission (Wu et al., 12 Dec 2025).

7. Significance, Generalization, and Future Directions

The TRACS Kaggle Challenge formalizes a high-precision, multi-label classification task foundational for telescope bibliography management and scientific impact quantification. The top-performing amc method demonstrates that retrieval-augmented prompting with LLMs is viable, robust, and cost-effective for this domain, even with incomplete or noisy underlying metadata. The structure of amc generalizes naturally to other observatory bibliographies and offers a blueprint for broader applications in scientific literature mining.

Future improvements include per-class prompt optimization, explicit handling of telescope aliases for recall enhancement, and transitioning towards agentic, retrieval-augmented architectures for more granular evidence selection. The use of discrete labeling is an acknowledged bottleneck relative to soft or probabilistic label modeling. An expanded reference dataset with consensus-labeled gold samples would enable more detailed benchmarking and error analysis (Wu et al., 12 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

amc: The Automated Mission Classifier for Telescope Bibliographies (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to TRACS Kaggle Challenge.