STELLA Benchmark Overview

Updated 14 January 2026

STELLA Benchmark is a collection of rigorously defined, reproducible datasets and evaluation protocols applied across diverse domains such as academic search, astrophysics, PDE solving, gyrokinetics, protein prediction, and time series forecasting.
It integrates specialized methodologies—from RESTful microservices and advanced query interleaving in IR to stencil-based computations and LLM-driven semantic forecasting—to provide direct, comparable performance metrics.
Standardized evaluation metrics and reproducible code frameworks ensure transparent benchmarking and robust comparisons, driving innovation and actionable insights in complex scientific and engineering applications.

STELLA Benchmark

The term "STELLA Benchmark" refers to several distinct, rigorously constructed benchmark datasets, evaluation protocols, and code frameworks across domains such as information retrieval, astrophysical radiative transfer, stencil-based PDE solving, gyrokinetics, protein function prediction, and time series forecasting. Each instantiation of STELLA provides a rigorously defined methodology, publicly available corpus, and robust evaluation metrics designed to facilitate direct, reproducible comparison among models and algorithms.

1. STELLA for Living-Lab Academic Search Evaluation

STELLA, as described in (Breuer et al., 2022), is a living-lab "wrapper" for online evaluation of academic search and recommendation systems. It enables direct integration of Dockerized, RESTful microservices as experimental retrieval or recommendation engines alongside production systems. The core architecture comprises:

Experimental Micro-services: Each participant submits a container exposing REST endpoints for ranking (/rank?query=q) and recommendation (/recommend?docid=d), operating on the same fully disclosed corpus as the production system.
Multi-Container Application (MCA): At runtime, the MCA orchestrates container execution, merges baseline and experimental results via Team-Draft Interleaving, and ensures fully comparable, randomized A/B testing without session disruption.
Central Server & Analytics Dashboard: Manages registration, logging of all requests and interactions, and provides real-time analytics (win/loss/tie, click distributions, session satisfaction, reward metrics) with fine-grained drill-down.

Experiments define a session as $S = \langle q_1, …, q_n;\ \textrm{clicks};\ \textrm{dwell times}\rangle$ and implement formal interleaving of ranked lists. Evaluation is based on:

Click Attribution & Win/Loss/Tie: Each click is attributed to baseline or treatment via interleaving team assignment, yielding per-session outcome margins, $\Delta = C_B - C_A$ .
Weighted Reward: Differential SERP-element weighting, $R(S) = \sum_{\text{clicks } j \in S} w_{\text{element}(j)}$ , is aggregated to yield system-level mean reward.
Key Metrics:
- Click-Through Rate: $\text{CTR}_X = \frac{\text{Total Clicks}_X}{\text{Total Impressions}_X}$
- Mean Dwell Time: $\text{MDT}_X = \frac{1}{\text{Clicks}_X} \sum_{c} \text{dwell}_c$
- Session Satisfaction: $\text{SessionSat}_X = \frac{1}{|\text{Sessions}_X|}\sum_S \text{SS}_S$ , where $\text{SS}_S$ is an indicator for clicks
- Reward and Interleaving-based statistics as above

Logging infrastructure records all HTTP events, interactions, and click metadata at scale, allowing for high-resolution, multi-corpus analytic slicing. The initial CLEF LiLAS 2021 evaluation involved nine systems, demonstrating that production academic search is highly optimized (no significant click-rate win by experimental systems), but the reward metric exposed systems yielding longer dwell times through more appealing result presentations. The design enforces strict reproducibility: containers and code can be reused directly by other labs for subsequent evaluation (Breuer et al., 2022).

2. STELLA in the Aerospace Domain: Terminology-Aware IR Benchmark

The STELLA framework for aerospace IR, introduced in (Kim, 7 Jan 2026), addresses the lack of a public benchmark capturing terminology-driven query intent on aerospace documentation:

Data Construction Pipeline:
- Terminology Concordant Query (TCQ): Embeds technical term(s) verbatim for lexical-matching evaluation.
- Terminology Agnostic Query (TAQ): Substitute descriptions for terms for semantic-matching evaluation.
- 5. Cross-lingual extension: Hybrid translation (preserve terms, translate context) generates 7×1,000 queries (languages: En, Ko, Id, Th, Fr, Zh, Ja) in BEIR format.
Formal Task Definition:
- $Q_\mathrm{TC}(P)$ targets lexical matching (BM25 relevance).
- $Q_\mathrm{TA}(P)$ targets semantic matching (encoder cosine similarity).
- Dual scoring functions:
- $S_\mathrm{lex}(Q, P)$ : BM25
- $\Delta = C_B - C_A$ 0: embedding cosine
Query Generation:
- Chain-of-Density (CoD) yields progressively denser queries within passage context.
- Self-Reflection: LLM-generated critiques enforce answerability, intent, and no external leakage at each generation stage.
Retrieval Metrics:
- Precision@k, Recall@k, MAP, nDCG@10 over defined query-passage pairs.
Empirical Findings:
- Decoder-only embedding models (Llama-Embed-Nemotron) minimize the gap between lexical and semantic matching, outperforming bi-encoders and BM25 on TAQ.
- BM25 remains dominant for direct terminology matching (TCQ), but fails for TAQ.
- Hybrid translation in cross-lingual settings preserves retrieval performance, justifying real-user query emulation (Kim, 7 Jan 2026).

Reproducible code, dataset, and pipeline documentation are provided for complete transparency and extensibility.

3. STELLA in Astrophysical Radiative Transfer

STELLA (Kozyreva et al., 2020) is a time-dependent, multigroup radiation hydrodynamics code extensively benchmarked for SN Ia, SN II-peculiar, and SN II-P LCs, with particular focus on the role of the thermalisation parameter $\Delta = C_B - C_A$ 1 (fraction of line opacity treated as absorption):

Governing Equation: Time-dependent RT equation with source function $\Delta = C_B - C_A$ 2, where $\Delta = C_B - C_A$ 3 is imposed globally for all bound-bound transitions.
Benchmarks:
- Grid-scan in $\Delta = C_B - C_A$ 4 (1, 0.9, 0.8, 0.5, 0.1, 0) and comparison to ARTIS and observed LCs.
- U, B sensitivities: blue-band LCs are excessively broadened for $\Delta = C_B - C_A$ 5 in all SN types.
- Best agreement in B/V (Type Ia), U/B (II-P/II-pec) consistently occurs for $\Delta = C_B - C_A$ 6.
- Strongly supports $\Delta = C_B - C_A$ 7 as default: mimics partial fluorescence without excessive blue trapping.

Correlation, least-squared, and linear regression analyses confirm this setting across nearly all bands and SN types, providing a prescription for future STELLA-based LC modeling (Kozyreva et al., 2020).

4. STELLA in Stencil-Based PDE and Parareal Frameworks

In (Arteaga et al., 2014), STELLA (STEncil Loop LAnguage) is a C++ DSEL for high-performance stencil computations, explicitly benchmarked in the context of the Parareal time-parallelization scheme:

Problem: 3D advection-diffusion with time-dependent $\Delta = C_B - C_A$ 8.
Implementation: Coarse (upwind+Euler) and fine (RK4+4th order FD) solvers as STELLA stencils; time parallelism by MPI, spatial within-node via OpenMP/CUDA.
Quantitative Results:
- Parareal converges in $\Delta = C_B - C_A$ 9 iterations (defect below fine-mesh error).
- Speedup closely follows theoretical bound to $R(S) = \sum_{\text{clicks } j \in S} w_{\text{element}(j)}$ 0 time slices, with parallel efficiency $R(S) = \sum_{\text{clicks } j \in S} w_{\text{element}(j)}$ 1 20—32%.
- Energy-to-solution overhead matches $R(S) = \sum_{\text{clicks } j \in S} w_{\text{element}(j)}$ 2 bound; communication costs negligible to this scale.

This validates the Parareal-STELLA approach for large-scale PDEs, yielding energy-efficient, scalable time-parallel solutions (Arteaga et al., 2014).

5. STELLA as a Gyrokinetic Benchmark Suite

STELLA (Barnes et al., 2018, González-Jerez et al., 2021, St-Onge et al., 2022) supplies comprehensive benchmarks for collisional and collisionless $R(S) = \sum_{\text{clicks } j \in S} w_{\text{element}(j)}$ 3 gyrokinetics in both axisymmetric and 3D stellarator geometry:

Physics: Mixed implicit–explicit time stepping, VMEC/Miller geometry support, flux-tube and global approaches.
Benchmarks:
- Agreement with GS2 (axisymmetric) and GENE (stellarator, W7-X; see (González-Jerez et al., 2021)) to within a few percent in linear growth rates, frequencies, eigenfunctions, and nonlinear saturated flux.
- New flux-tube–based global scheme (St-Onge et al., 2022) reproduces radial profiles and fluxes accurately, eliminating Dirichlet boundary spectral artifacts and restoring quantitative agreement with local theory.
Parallelization: Strong scaling up to $R(S) = \sum_{\text{clicks } j \in S} w_{\text{element}(j)}$ 4 cores is feasible with negligible communication overhead in explicit/semi-Lagrangian steps, and efficient Green's function solves for implicit parallel terms.

STELLA in this context sets a de facto standard for stellarator/tokamak code validation (Barnes et al., 2018, González-Jerez et al., 2021, St-Onge et al., 2022).

6. STELLA for Multimodal Protein Function Prediction

STELLA (Xiao et al., 4 Jun 2025) represents a multimodal LLM framework, integrating sequence–structure embeddings (ESM3) with a Llama-3.1 backbone for protein function prediction via the OPI-Struc instruction dataset:

Model: Frozen ESM3 encoder; trainable linear connector; Llama-3.1-8B-Instruct LLM; two-stage MMIT for alignment and instruction following.
Tasks: Functional description generation (free-text and multiple-choice) and enzyme-catalyzed reaction (EC label) prediction.
Evaluation:
- BLEU-4, BERTScore, ROUGE, accuracy.
- STELLA advances state-of-the-art by +3 BLEU-4 and +2 ROUGE-L over prior models, with top enzyme-naming accuracy (88.85%).
Ablations: Show ESM3+Llama-3.1 yields optimal feature clustering and free-text generation; two-stage fine-tuning and dialogue augmentation further increase robustness.

This benchmark establishes the paradigm of LLM-augmented multimodal models for bioinformatics (Xiao et al., 4 Jun 2025).

7. STELLA for Time Series Forecasting with LLMs

STELLA (Fan et al., 4 Dec 2025) introduces semantic–temporal alignment for time series forecasting by dynamically abstracting trend, seasonality, and residual, then mapping these via hierarchical semantic anchors into a frozen LLM context:

Architecture: Neural STL decomposition, patch embeddings, CSP (Corpus-level) and FBP (Instance-specific) prompts as cross-attention prefix tokens for the transformer.
Datasets: Eight multivariate time-series datasets (ETTm1, ETTm2, ETTh1, ETTh2, Weather, Exchange, Illness, M4).
Metrics: MSE, MAE, SMAPE, MASE, OWA.
Performance:
- STELLA surpasses prior art in both zero-shot and few-shot transfer across all benchmarks (e.g., ETTm2 few-shot: 0.291 MSE vs. 0.303 for TiDE).
- Ablations demonstrate explicit decomposition (N-STL) as critical; both CSP and FBP non-redundantly improve accuracy by 1–3%.

STELLA thereby systematizes prompt-driven LLM forecasting from explicit component analysis, merging statistical and LLM paradigms (Fan et al., 4 Dec 2025).

In summary, "STELLA Benchmark" denotes a set of state-of-the-art, domain-specific benchmark frameworks, each exemplifying formalized protocols, public reproducibility, and fine-grained analysis for robust, transparent evaluation across IR, astrophysics, computational physics, molecular biology, and ML-based forecasting.