TrainTicket Benchmark: Evaluating Microservices

Updated 3 February 2026

TrainTicket Benchmark is a comprehensive open-source evaluation suite simulating real-world train ticketing systems through 45 microservices and realistic workflows.
It offers empirical insights into fault diagnosis, system orchestration via choreography and Temporal-based orchestration, and meta-learning for trace anomaly classification.
The benchmark extends to public transport routing, applying multi-criteria optimization algorithms for fare and journey planning while reducing debugging time.

The TrainTicket Benchmark is a widely adopted open-source evaluation suite and application system for research on microservice-based architectures (MSS), model-based fault diagnosis, public transport optimization, and observability-driven AIOps. Developed to model the complexity of a real-world train-ticketing system, it provides a standardized, reproducible platform supporting the empirical evaluation of system orchestration approaches, debugging methods, meta-learning frameworks for anomaly diagnosis, and price-optimal routing algorithms. Its architecture, datasets, and workflow realistically replicate workflow-driven enterprise systems and public transport scenarios, enabling robust, large-scale experimental studies in cloud-native environments (Wang et al., 2024, Nadeem et al., 2022, Euler et al., 2022).

1. System Architecture and Scope

TrainTicket is a medium-scale, reference microservice system simulating an end-to-end ticket-booking workflow. It comprises 45 independently deployable microservices implemented in Java, NodeJS, Python, and Go. These services encapsulate functional areas such as user management, seat availability, payment, refunds, order processing, and notification delivery. User requests flow through an API gateway and trigger fan-out operations across the microservices, supporting detailed workflows for search, booking, reservation, payment, refund, cancellation, and schedule changes.

Inter-service communication employs both synchronous (HTTP/REST) and asynchronous messaging (event streams and queues), forming a realistic substrate for investigating faults, concurrency, timing anomalies, and emergent behavior. OpenTelemetry instrumentation is applied system-wide, ensuring span-level traceability and consistent semantic log collection for all operations. Every request is captured as a trace—a rooted tree of spans annotated with precise timestamps and event logs—forming the basis of subsequent trace-driven ML and debugging analyses (Wang et al., 2024).

2. Fault Datasets and Abnormal Trace Benchmarking

The TrainTicket Benchmark incorporates diverse, multi-modal datasets containing both normal and abnormal execution traces. Two primary datasets underpin its trace anomaly and fault diagnosis features:

DeepTraLog: Provides traces from 14 real-industrial fault branches (e.g., misconfigured services, cascading call failures, race conditions, monolithic component faults). Faults are injected as Git branches and traces collected under representative load.
Nezha: Delivers traces from synthetic injection of four low-level fault types—including CPU-intensive, CPU-consumption, service exceptions, and incorrect message returns—across every service pod. Traces are labeled by fault type, with consistent OpenTelemetry-based capture.

Fault categories total 30, grouped into asynchronous-interaction faults (dangling callback, lost response), multi-instance issues (stale cache), configuration errors (wrong timeout, misrouted endpoint), monolithic component faults, resource-level injections, and messaging faults (service exception, message return mismatch). This taxonomy supports systematic construction of N-way K-shot meta-learning classification tasks: typically, 5-way tasks with support and query sets drawn from labeled instances in the abnormal trace corpus (Wang et al., 2024).

3. Workflows: Choreography, Orchestration, and Workflow Engines

TrainTicket natively implements a choreography-based event-driven architecture: microservices independently publish and subscribe to events, generating business workflows organically via event patterns. The emergent workflows, such as ticket cancellation or refund, traverse numerous services through asynchronous domain events (e.g., TicketReserved, PaymentProcessed). Complexity arises from non-deterministic event ordering (due to network delays or broker reordering), concurrent request interleaving, and distributed fault propagation, all exacerbated by the absence of a centralized controller or explicit global view (Nadeem et al., 2022).

To investigate the benefits and challenges of orchestration, the TrainTicket Benchmark has been ported to the Temporal stateful workflow engine. In this orchestrated variant, business processes are formalized as deterministic Temporal Workflows, which schedule Activities (microservice calls) and manage sequencing, retries, and timeouts within centrally controlled, checkpointed execution logic. The refactoring replaces approximately 1000 lines of choreography “glue code” with 800 lines of workflow and activity definitions, and introduces Temporal server infrastructure for global queue and worker management. This setup enables deterministic replay, rollback, and unified global traceability for debugging (Nadeem et al., 2022).

4. Applications in Trace-Based Fault Classification and Meta-Learning

TrainTicket’s richly labeled abnormal trace dataset underpins meta-learning approaches for automatic anomaly diagnosis. TraFaultDia exemplifies this by treating abnormal trace classification as a few-shot multi-class classification problem within and across MSS. A typical experimental paradigm leverages a 5-way K-shot task—each task is a classification among five fault types, with only K labeled support examples per type.

The core meta-learning architecture consists of a multi-modal trace embedding model (MultiHAttenAE) fusing span and log features, followed by a lightweight Transformer Encoder (TE) meta-learner trained via a MAML-style algorithm. During meta-training, inner-loop gradient steps on support sets adapt the TE; meta-parameters are refined in the outer loop across tasks.

Within-system evaluation (TrainTicket→TrainTicket) yields:

5-shot accuracy: 92.91% ± 2.10
10-shot accuracy: 93.26% ± 1.40

Ablation studies demonstrate the benefit of cross-modal fusion (≈10% gain over spans-only input) and the competitiveness of simple prototypical baselines (e.g., Nearest Neighbor at 92.56% on 10-shot tasks). These results attest to the benchmark’s capacity for evaluating both expressive meta-learning architectures and traditional statistical baselines under low-labeling conditions (Wang et al., 2024).

5. Debugging and Diagnostic Evaluation

TrainTicket is specifically engineered for controlled fault injection and systematic debugging studies. Twenty-two industrially inspired faults (F1–F22) are categorized as interaction (13 faults), environment (5 faults), and internal (4 faults). Faults are induced through network delays, thread-pool exhaustion, SQL errors, or internal logic bugs.

A controlled debugging experiment with a professional developer (no prior knowledge, 2+ years’ experience) compares time-to-fix per fault:

Metric	Choreography	Orchestration (Temporal)
Avg. time per fault (20 fc)	7.91 h	3.96 h
Speedup	–	2.0×
Absolute time saved	–	3.95 h
Percent improvement	–	50%

Orchestration enables unified, end-to-end traces with deterministic replay and state checkpointing—drastically reducing fault localization overhead and yielding a near 50% reduction in debugging time. Temporal’s Web UI and global trace graph are cited as workflow-level diagnostic advantages (Nadeem et al., 2022).

6. Extensions to Public Transport Routing and Fare Optimization

In the context of fare optimization and journey planning, TrainTicket supports the modeling and benchmarking of price-optimal earliest arrival problems (POEAP) using advanced graph-based representations. The benchmark formalizes conditional fare networks (CFNs) as six-tuples capturing transport network topology, ticket graphs, fare transition functions, partially ordered monoid weights, state/event sets, and price maps.

Journey optimization is based on adaptation of multi-criteria RAPTOR (McRAP) algorithms for CFNs, maintaining per-stop, per-round “labels” tracking arrival times and fare states. Labels are propagated by event-driven transitions and pruned via Pareto-dominance and state-dominance criteria; dominance partitions are parametric with respect to ticket comparability. Computational hardness is established (NP-hard), but for finite monoids and realistic fare graphs, query times remain practical (≤400 ms for unrestricted McRAP, ≤10 ms for Pareto-pruned variants with tight slack and state-based pruning) (Euler et al., 2022).

7. Limitations and Future Directions

The TrainTicket Benchmark, while comprehensive for both microservice debugging and public transport routing, has certain constraints:

Current abnormal trace benchmarks omit integration of trace-level resource metrics (CPU, disk), complicating discrimination of performance anomalies versus resource faults.
Trace classification experiments are limited to 5-way tasks and two MSS datasets; real-world deployments may involve higher-dimensional or evolving taxonomies.
The orchestrated Temporal variant, although superior for debugging, introduces additional operational dependencies and determinism requirements.
In POEAP benchmarks, practical performance is closely tied to preprocessing (route duplication, footpath closure), per-ticket state-space pruning, and label-bag efficiency strategies.

Proposed extensions include incorporation of additional observability signals (metrics, structured events), scaling to more complex workflows and fault taxonomies, and empirical validation on production-scale MSS beyond TrainTicket (Wang et al., 2024, Euler et al., 2022, Nadeem et al., 2022).

Markdown Report Issue Upgrade to Chat

References (3)

Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning (2024)

A Case for Microservices Orchestration Using Workflow Engines (2022)

Price Optimal Routing in Public Transportation (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TrainTicket Benchmark.

TrainTicket Benchmark: Evaluating Microservices

1. System Architecture and Scope

2. Fault Datasets and Abnormal Trace Benchmarking

3. Workflows: Choreography, Orchestration, and Workflow Engines

4. Applications in Trace-Based Fault Classification and Meta-Learning

5. Debugging and Diagnostic Evaluation

6. Extensions to Public Transport Routing and Fare Optimization

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TrainTicket Benchmark: Evaluating Microservices

1. System Architecture and Scope

2. Fault Datasets and Abnormal Trace Benchmarking

3. Workflows: Choreography, Orchestration, and Workflow Engines

4. Applications in Trace-Based Fault Classification and Meta-Learning

5. Debugging and Diagnostic Evaluation

6. Extensions to Public Transport Routing and Fare Optimization

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research