MASTEST System Overview

Updated 29 November 2025

MASTEST System is a multi-paradigm framework that supports automated test generation, adaptive psychometric testing, continuous integration for scientific computation, and device orchestration in astrophysics.
It employs rigorous methods including evolutionary algorithms, large language model agents, and sequential GLR tests to optimize coverage, fault detection, and test efficiency.
Applications range from RESTful API testing and computerized adaptive testing to CI in stellar astrophysics and hierarchical control in solar telescope operations.

MASTEST designates several distinct, technically rigorous systems for automated test generation, adaptive psychometric testing, continuous integration in scientific computation, and device orchestration in astrophysics. The acronym has appeared in multiple research contexts; this article delineates the key architectures, algorithms, workflows, and evaluation constructs for each paradigm, referencing the principal published implementations and evaluations.

1. Evolutionary Multi-Context Automated System Test Generation

MASTEST, as described in EvoMaster, refers to an open-source, white-box test generation system targeting RESTful web services on the JVM (Arcuri, 2019). The architecture is modular, partitioned into a Core Process (evolutionary search), Driver Process (harness for SUT execution and instrumentation), and an Analysis & Feedback Module.

The Core Process parses configuration options and drives the evolutionary algorithm (MIO by default), emits test classes in Java/JUnit using RestAssured, and ingests fitness feedback from the harness via JSON.
The Driver Process runs the SUT (either embedded or external), instruments bytecode (ASM library) to collect statement and branch coverage, and exposes “run this test case” over a minimal REST API.
The Analysis & Feedback Module computes fitness for each coverage target (e.g., branch), aggregates scores per individual suite, and guides subsequent search generations.

The evolutionary search applies MIO, WTS, or MOSA algorithms. Individuals are encoded as sequences of HTTP calls where each gene covers method, path (with placeholders), headers, query parameters, payload (including dependencies such as shared response field extraction). Fitness for each branch $b$ :

$f_b(x) = \begin{cases} 0, & \text{if branch } b \text{ is covered} \ \frac{d_{\text{raw}}(x, b)}{d_{\text{raw}}(x, b) + 1}, & \text{otherwise} \end{cases}$

Selection proceeds via tournaments; crossover is single-point at call boundaries; mutation allows method switching and payload perturbation. Termination occurs on timeout or full coverage.

Multi-context encoding incorporates request parameters, payload structures, session state (cookies, tokens), and environment (e.g., databases). Context dimensions propagate through mutation and crossover. The system’s modular driver interface allows protocol/language portability through JSON-over-TCP.

Empirical evaluation over three large-scale APIs (2 open-source, 1 industrial; 2k–10k LOC) yielded 20–40 % statement coverage and detected 38 sever 5xx faults, with coverage growth saturating after rapid initial increments.

SUT	LOC	Coverage (%)	Faults Found
OpenApi1	2345	35	12
OpenApi2	9812	22	9
IndustrialSvc	4765	39	17

Planned extensions include database-aware heuristics, support for non-JVM SUTs, richer session/context modeling, grammar-based payloads, and hybridization with symbolic execution.

2. LLM-Based Multi-Agent System for RESTful API Testing

MASTEST also denotes a web-based, multi-agent system employing LLMs and coded agents for automated RESTful API testing (Han et al., 22 Nov 2025). The toolchain covers the entire workflow from OpenAPI specification parsing through scenario generation, script synthesis, execution, and coverage analysis, integrating human-in-the-loop quality control.

Agents and Workflow:

API Parser (coded): parses Swagger/OpenAPI JSON, enumerates operations, parameters, schemas, and commits op-metadata to a MySQL datastore.
Unit/System Test Scenario Generators (LLM-based): generate natural-language test scenarios (positive, negative, edge-case, workflow) from API specs.
Test Script Generator (LLM-based): produces Pytest scripts in Python, with proper request logic and assertion structures for status codes and response bodies.
Data Type and Syntax Checkers (LLM/coded): verify parameter-type consistency (LLM) and syntactic validity (ast.parse).
Test Script Executor and Result Correctness Checker (coded): execute scripts via Pytest, parse logs, and annotate failures.
Status Code Coverage Checker (LLM-based): assesses static and dynamic status code coverage (operation–spec conformance).
Human review gates require scenario and script inspection before progressing.

Metric Definitions:

$\begin{align*} \text{Syntax correctness:}\quad &Cor_\text{Syn}(api) = |\{ t \mid \text{Valid}_\text{Syn}(t) \}| / |T_{LLM}(api)|\ \text{Data type correctness:}\quad &Cor_{DT}(api) = |\{ t \mid \text{Valid}_{DT}(t) \}| / |T_{LLM}(api)|\ \text{Unit scenario coverage:}\quad &Cov_{US}(api) = |\bigcup_{op}( S_{LLM}(op) \cap S_{Fin}(op) )| / |\bigcup_{op}S_{Fin}(op)|\ \text{Operation coverage:}\quad &Cov_{Ops}(api) = |\text{Ops}(T_{Fin}(api))| / |\text{Ops}(api)|\ \end{align*}$

Empirical Results for five public APIs (Car, Petstore3, Bills, Canada Holidays, Cat Fact), tests run on GPT-4o and DeepSeek V3.1 Reasoner:

Metric	GPT-4o	DeepSeek
Syntax correctness	100 %	100 %
Data type correctness	74.6 %	87.8 %
Unit scenario coverage	94 %	98 % (avg)
System scenario coverage	79 %	78 % (avg)
Operation coverage	100 %	98 %
Bug detection (total)	158	162
Usability (edit dist.)	30.3	24.6 chars
Static status code cov.	54.6 %	83.8 %

Key findings indicate robust feasibility of end-to-end LLM-driven API test automation, with DeepSeek leading in data type correctness, status code detection, and script editability; GPT-4o was consistently highest in operation coverage. Manual reviewer edits were minimal, and full syntax correctness was always achieved.

Planned extensions involve CI/CD pipeline integration, more complex adequacy metrics (e.g., EvoMaster/Jacoco coverage), hierarchical scenario decomposition for token management, and enhanced automated artifact validation agents.

3. Adaptive Mastery Test Systems in Computerized Testing

In psychometric and educational measurement, MASTEST references adaptive mastery test architectures using sequential GLR-based test statistics (Bartroff et al., 2011). The underlying psychometric model is the Three-Parameter Logistic (3PL):

$p_j(\theta) = c_j + (1-c_j)/ \bigl(1+e^{-a_j(\theta-b_j)}\bigr)$

Errors are controlled using dual-boundary sequential tests with maximum test length $N$ and indifference region $(\theta_-, \theta_+)$ :

Null Hypothesis ( $H_0$ ): Mastery ( $\theta \geq \theta_+$ ), Alternative ( $H_1$ ): Non-mastery ( $\theta \leq \theta_-$ ).
Update the log-likelihood $\ell_n(\theta)$ over observed responses, select the next item maximizing Fisher information at the current estimated ability, $\hat{\theta}_n$ .
GLR statistic:

$\Lambda_n = \log\frac{\sup_{\theta\leq\theta_{-}}L_n(\theta)}{\sup_{\theta\geq\theta_{+}}L_n(\theta)}$

Testing stops if $\Lambda_n \geq A$ (declare non-mastery) or $\Lambda_n \leq -B$ (declare mastery). If max length reached, a final threshold $C$ applies. Parameters $A,B,C$ are calibrated for target Type I/II error rates ( $\alpha, \beta$ ), typically via Monte Carlo.

Simulation results on a real 3PL pool (ETS/Chauncey, $N=1136$ items) with $(\alpha, \beta)=(0.05,0.05), N=50, \text{offset}=0.25$ :

Method	Avg Length	Type I Error	Power (1–β)
Fixed-length (N=50)	50.0	5 %	95.0 %
TSPRT (Wald approx)	44.2	16.1 %	–
Modified TSPRT (retuned at N)	44.2	5.0 %	–
Modified GLR (modHP)	24.5	5.0 %	–

GLR-based tests outperform fixed-length and SPRT in reducing average test size while maintaining strict error control, and are asymptotically first-order optimal for expected length.

Practical deployment involves large-scale item pool calibration, adaptive item selection via information maximization, and content/exposure control through balanced sampling or stratification.

4. Continuous Integration for Scientific Codes: MESA Stellar Astrophysics Testing

The MASTEST infrastructure for the MESA ("Modules for Experiments in Stellar Astrophysics") project enables robust, heterogeneous continuous integration across diverse computational environments (Wolf et al., 2023).

Architecture:

Test Harness (MESA scripts): modular shell/Fortran code for running test cases per module, collecting run metrics, and initializing metadata.
Local Orchestration (mesa_test Ruby gem): manages git mirror and worktrees per commit, launches harness scripts, generates JSON payloads, attaches machine metadata, and posts results to a cloud API.
Scheduler Layer: supports both serial and parallel execution models, integrating with cluster queueing systems (SLURM, PBS, LSF). One job per test case enables efficient resource utilization.
Result Collector (TestHub): Rails application ingests JSON metrics and logs, aggregates statistics, and provides rich web-based visualization.
Database (PostgreSQL): normalized schema tracks branches, commits, machines, test cases, metadata, run metrics, enabling historical trend analyses.
Visualization/Dashboard: web front-end offers commit, test case, and historical views, plus automated daily failure regression emails and Slack notifications.

Regression Detection:

Statistical rules flag regressions if new runtime $t_\mathrm{new} > \mu_t + k\sigma_t$ , typically with $k=3$ . Historical tables permit tracking of both performance and physics output drift.

Integration with version control (GitHub webhooks) automates branch/commit synchronization. Commit-message flags allow customized test behaviors (e.g., skipping full suite, running optional inlists).

5. MARST: Multi-channel Antarctic Solar Telescope Software Control System

Within the MARST solar telescope project, MASTEST identifies the hierarchical control and test architecture for coordinating device operation and observation (Chen et al., 2018).

Layered Architecture:

Device Control Layer: EPICS IOCs front-end each device (mount, focuser, filter wheels, dome, Andor/PI CCDs) through TCP/serial/V4L2 drivers.
Observation Operation Layer: RTS2 core interprets XML device definitions, instantiates plan classes (dual-tube workflows), manages plan prioritization and resource locks (e.g., mount control).
User Interface Layer: PyQt5/QML GUI interacts via HTTP/JSON (rts2-httpd) and EPICS CA (rts2-proxy). GUIs provide manual device control, plan management, FITS image display, live telemetry, and log viewing.

Key Algorithms:

Self-Guiding Actor: computes solar centroid from image intensity, drives corrective mount slews when drift exceeds threshold, maintaining accurate sun tracking over multi-hour runs.
Flat-Field Exposure Plan: executes exposures at 12 azimuth positions, calibrating median ADU per image, and ensures mechanical/seeing stabilization.
Resource Arbitration: plan queue maintains mount locks; higher-priority plans preempt running lower-priority plans, ensuring robust multi-tube coordination.

Performance Metrics:

IOC command-response latency (<50 ms), plan-switch latency (~2 s), self-guiding drift (<2″ over 2 h), flat-field uniformity (<2 %), GUI refresh rates (1 Hz–500 ms), sustained control CPU load (<30 %), and uninterrupted device server operation beyond one week.

6. Comparative Paradigms and Domain Significance

The name MASTEST thus subsumes divergent research systems across software engineering, psychometrics, distributed scientific workflow, and instrument control.

In test generation, the distinguishing feature is multi-context gene encoding coupled with evolutionary optimization, supporting high fault detection and adaptable context modeling.
In LLM-driven agent orchestration, modular decomposition and human-in-the-loop oversight prove tractable for full-stack RESTful API test automation, with coverage, correctness, and usability metrics for quantitative assessment.
In adaptive mastery testing, advanced sequential GLR designs optimize test brevity and error precision with strong theoretical guarantees.
In continuous integration for computational science, MASTEST enables cross-platform automated validation, regression alerting, and historical performance tracking.
In telescope automation, the layered device-operation architecture generalizes resource arbitration, priority scheduling, and behavioral feedback algorithms.

MASTEST implementations are generally open source and parameterized for extension, with future work emphasizing cross-language support, advanced context modeling, integration with symbolic analysis, richer data generation, and broader workflow orchestration.

Principal references: (Arcuri, 2019, Bartroff et al., 2011, Chen et al., 2018, Wolf et al., 2023, Han et al., 22 Nov 2025).