TransitBench: Multi-Domain Benchmarking
- TransitBench is a versatile benchmark suite covering distinct domains, including transit forecasting, video transition generation, service simulation, and route network design.
- It aggregates artifacts from MBTA day-ahead forecasting, endpoint-conditioned video transitions, simulation sandboxes, and TRNDP frameworks, each with tailored evaluation protocols.
- The benchmark standardizes evaluation metrics and data schemas to enable controlled, reproducible comparisons and actionable insights across heterogeneous transit research applications.
Searching arXiv for papers using the name "TransitBench" to ground the article in current literature. TransitBench is a reused benchmark name across several research programs in transportation, simulation, and generative modeling. In the current literature, it denotes at least four distinct artifacts: a reusable next-day forecasting benchmark for Massachusetts Bay Transportation Authority (MBTA) usage and delay prediction (Nalamalpu et al., 2 Dec 2025), a public benchmark for endpoint-conditioned video transition generation (Yang et al., 3 Aug 2025), an open simulation sandbox for comparing fixed-route, semi-flexible, and on-demand transit services (Yoon et al., 2021), and, in suite-level discussions, a container for city-scale transit route network design, transit analytics, and schedule-based assignment modules (Poudel et al., 27 May 2026). This suggests that the term functions less as a single canonical dataset than as a label for benchmark-oriented infrastructures whose common aim is controlled, reproducible comparison under shared inputs, metrics, and evaluation rules.
1. Scope, nomenclature, and benchmark families
A central feature of the term is its polysemy. In transportation forecasting, TransitBench refers to a day-ahead benchmark over MBTA system-wide aggregates. In video generation, it refers to a dataset of endpoint image pairs for transition synthesis. In service-design simulation, it denotes an open sandbox for side-by-side comparison of transit operating paradigms. In broader benchmark-suite proposals, it is also used as an umbrella under which route design, analytics, and assignment modules can be standardized (Nalamalpu et al., 2 Dec 2025, Yang et al., 3 Aug 2025, Yoon et al., 2021, Poudel et al., 27 May 2026).
| Variant | Domain | Central object of evaluation |
|---|---|---|
| MBTA forecasting TransitBench | Public-transit forecasting | Next-day system-wide totals of gated station entries and delay counts |
| VTG TransitBench | Image-to-video generation | Intermediate frames between first and last images for concept blending and scene transition |
| Transit service simulation sandbox | Service design evaluation | Fixed-route, semi-flexible transit, and on-demand microtransit under shared geography and demand |
| Proposed suite-level TransitBench components | TRNDP, analytics, assignment | Route tuples, aggregate-query workloads, and schedule-based user-equilibrium instances |
This terminological divergence matters methodologically. The MBTA benchmark is a supervised forecasting benchmark with RMSE-centered evaluation. The VTG benchmark is an endpoint-conditioned generation benchmark whose evidence is primarily qualitative and preference-based. The simulation sandbox is a discrete-time comparative operations platform. The suite-level proposals are benchmark architectures intended to standardize schemas, constraints, and solver comparisons rather than to define a single fixed dataset. Any technical discussion of TransitBench therefore requires explicit disambiguation by field and task.
2. MBTA day-ahead forecasting benchmark
TransitBench in "Forecasting MBTA Transit Dynamics: A Performance Benchmarking of Statistical and Machine Learning Models" is a reusable, next-day forecasting benchmark distilled from MBTA operational and weather data (Nalamalpu et al., 2 Dec 2025). It defines two system-level, day-ahead prediction targets: next-day subway usage, measured as the total number of gated station entries across all MBTA subway stations, and next-day delay counts, measured as the total number of recorded service delay events across the MBTA system. Both tasks use daily temporal granularity, a one-day prediction horizon, and system-wide aggregates rather than station-level outputs.
The benchmark blends three data sources. MBTA Service Alerts cover 2019-01-01 to 2023-07-29 and yield 1,671 daily records after wrangling. MBTA Ridership: Gated Station Entries spans 2014-01-01 to 2025-06-30 and yields 4,199 daily records. Meteostat weather for Boston Logan International Airport supplies pressure, wind speed, average temperature, and precipitation as a city-level environmental proxy. Preprocessing produces a contiguous daily panel and a derived 36-column feature matrix per task. The core representation is a five-day sliding window in which lagged target values and lagged covariates from the previous five days are concatenated into a fixed-length feature vector. Calendar covariates comprise day of week and season, evaluated in both raw and one-hot forms; numeric features are tested as raw, standardized via StandardScaler, or in "scaled + one-hot" hybrids.
TransitBench evaluates ten models for usage and eleven for delays. The shared baselines are Moving Average, Linear Regression, Ridge Regression, Lasso Regression, Poisson Regression, K-Nearest Neighbor Regression, Support Vector Regression, Multilayer Perceptron, Random Forest Regression, and Gradient Boosting Regression. For delay modeling, a self-exciting point process is added:
The corresponding log-likelihood is
The branching ratio and half-life are reported as interpretable summaries of clustering strength and decay.
Evaluation uses 100 bootstrap cycles. In each cycle, daily records are resampled with replacement while preserving length, then split chronologically into the first 80% of dates for training and the last 20% for testing. This produces 1,333 train and 333 test days for delays, and 3,355 train and 839 test days for usage. Each model is trained across eight covariate blends and up to four data representations; for a given architecture and bootstrap cycle, both the best-performing combination ("Any-Data RMSE") and the lag-only baseline ("No-Additional-Data RMSE") are recorded. The primary metric is
Uncertainty is summarized by the RMSE distribution over the 100 bootstraps, with 95% confidence intervals.
The reported findings are unusually consistent. Ensemble tree methods and MLPs rank at the top across both tasks, with Random Forest and Gradient Boosting the most robust. Poisson and linear baselines underperform on daily aggregates. For the Hawkes model on delays, daily RMSE is 137.43 counts/day and next-event RMSE is 0.670 hours; the learned parameters are events/hour, , and , implying and hours. The moving-average baseline, however, achieves RMSE < 60 counts/day on daily aggregates, indicating that simple temporal smoothing can outperform a more expressive event model when the target is next-day totals rather than event-time risk.
Feature ablation is one of the benchmark’s main contributions. Day-of-week is the single most valuable non-target covariate for both tasks. Season helps, but less than day-of-week. Weather generally worsens performance when added alone or on top of calendar features, and SHAP analyses place lag-1 target first, day-of-week as the most influential non-target covariate, and weather lags last. The paper attributes the weak weather effect to overfitting, collinearity with calendar patterns, spatial heterogeneity not captured by single-station observations, and mismatch between meteorological variability and aggregate daily behavior. The benchmark’s practitioner guidance follows directly: prioritize short lag histories and calendar signals for day-ahead forecasting, and treat daily, system-wide weather covariates with caution.
3. Endpoint-conditioned benchmark for transition generation
In "Versatile Transition Generation with Image-to-Video Diffusion," TransitBench is a public benchmark for unified transition generation, specifically for concept blending and scene transition (Yang et al., 3 Aug 2025). Here, a video transition means synthesizing a temporally coherent sequence of intermediate frames between a given first frame and a given last frame, guided by descriptive text for the endpoints. The benchmark is introduced to address a gap left by adjacent problem settings such as image morphing, frame interpolation, and ad hoc self-collected transition clips.
TransitBench contains 200 pairs of pictures obtained from web resources, evenly split into 100 concept-blending cases and 100 scene-transition cases. Concept blending involves conceptually different endpoints such as "a lion" and "a truck," with the objective of generating gradual hybrids that merge semantics, shape, appearance, and attributes. Scene transition involves conceptually related endpoint scenes, such as a wooden house in a forest and a wooden house in snow, with the goal of generating natural scene-level transitions while preserving coherence and identity of salient elements. The benchmark is publicly hosted at https://huggingface.co/datasets/mwxely/TransitBench, and the corresponding project page is https://mwxely.github.io/projects/yang2025vtg/index.
The paper defines a metric suite recommended for TransitBench, even though it does not report a dedicated automatic-metric table for the benchmark itself. Fréchet Inception Distance (FID) is used for visual fidelity:
0
Perceptual Path Length (PPL) measures fluctuation in LPIPS along the interpolation path. CLIP-based temporal compositionality is assessed with TCR and TC-Score, using cosine similarity
1
Smoothness is adapted from AID by computing LPIPS on adjacent frames, taking the Gini coefficient of those distances, and reporting its inverse so that higher values indicate smoother change.
The paper’s empirical use of TransitBench is primarily qualitative and preference-based rather than tabular. Human evaluation is conducted through Amazon Mechanical Turk across the four VTG tasks, including TransitBench’s concept blending and scene transition. Workers are shown the two endpoint images and five candidate videos, then asked to choose the best video according to semantic coherence, temporal coherence, and fidelity. Each example is rated by 10 workers, and 150 valid responses are collected for each transition task after filtering random clicking. The qualitative findings reported for TransitBench are that VTG yields smoother and more semantically plausible concept blends, and more natural scene-level transitions with higher similarity and coherence than DiffMorpher, TVG, SEINE, DynamiCrafter, and Generative Inbetweening.
Several benchmark limitations are explicit. The dataset size is small at 200 pairs. The paper does not report train/validation/test splits, frame-rate or resolution distributions, category taxonomies, or inter-rater reliability. It does not specify directory structure, filenames, or whether canonical captions are distributed with the dataset, even though endpoint captions are used in experiments. Licensing and usage restrictions are also not stated. These omissions make TransitBench useful as a public testbed for endpoint-conditioned transition synthesis, but less standardized than mature generative benchmarks with official splits and metric code.
4. Open simulation sandbox for transit service design
In "A simulation sandbox to compare fixed-route, semi-flexible-transit, and on-demand microtransit system designs," TransitBench is an open, configurable simulation sandbox for side-by-side benchmarking of three families of transit service design: fixed-route service, semi-flexible transit via MAST, and on-demand microtransit (Yoon et al., 2021). Its purpose is not prediction or data retrieval, but controlled comparative simulation under common geography, demand, vehicle assumptions, and performance measures.
The sandbox standardizes a discrete-time event simulation with configurable simulation length and time step. Vehicles and passengers transition through operational states each second, including walking, waiting, boarding, onboard movement, alighting, and egress. Warm-up logic positions vehicles along the line before passengers arrive, and a scenario manager orchestrates runs across service designs and demand scenarios according to
2
The platform accepts either random demand via Poisson arrivals or exogenous origin-destination lists and maps demand onto a rectangular service region representing a route alignment and catchment. Inputs include route geometry 3, arrival rate 4, values of time 5, walking speed 6, maximum walking distance 7, vehicle capacity 8, vehicle speed 9, fleet size 0, dwell time 1, and design parameters such as stop counts, checkpoint counts, headway or frequency, slack time, backtracking limit 2, maximum wait 3, maximum detour rate 4, and depot set 5.
The fixed-route component models a single line with stop service, headway operation, dwell times, stopping delay, and capacity constraints. Under Poisson arrivals, expected wait is
6
A Tirachini-style cost decomposition can be used externally for line-level stop-spacing and frequency selection, with total hourly cost
7
The sandbox itself remains a simulation environment rather than an optimizer.
The semi-flexible component extends Mobility Allowance Shuttle Transit (MAST). Vehicles follow a trunk route between mandatory checkpoints but may deviate to serve virtual stops, provided the travel time plus inserted deviation time fits within slack time between checkpoints. The model includes walking to meeting points along deviated paths and constrains capacity, maximum wait, maximum detour rate, backtracking, and deviation distance. Candidate insertion positions for a route with 8 stops are counted as
9
Preliminary tests reported in the paper state that the walking extension raises acceptance rates by 23.6–88.0% relative to original MAST, albeit with higher weighted travel times and VMT.
The on-demand microtransit component uses a pickup-and-delivery insertion heuristic subject to capacity, pickup time windows, precedence, and detour constraints. For each request and each vehicle, feasible pickup and drop-off insertion pairs are evaluated and the minimum-incremental-travel-time option is selected; if none satisfies the constraints, the request is rejected. The paper frames the underlying structure through a canonical DARP formulation, but the sandbox solves the problem myopically rather than through full MILP optimization.
The B63 case study in Brooklyn illustrates the benchmark. The modeled corridor has length 13.12 km, width 1.6 km, 57 current stops, average bus speed approximately 11.41 km/h, and dwell time 20 seconds per stop. Demand scenarios are 0 passengers/hour. Five system designs are compared: existing fixed-route, externally optimized fixed-route, semi-flexible MAST with checkpoint variants, and on-demand microtransit. The main trade-offs are clear. Fixed-route consistently serves the most passengers. On-demand microtransit yields the lowest average weighted travel time, with examples around 47.6–54.4 minutes for 1–200 passengers/hour and about 53.5 minutes at 2, but it has the highest VMT, reported at roughly 3 miles as demand rises. Fixed-route optimized reduces VMT substantially relative to the existing design, with examples around 137.7–228.8 miles versus about 335.7 miles, but can increase user times at low demand. Semi-flexible service occupies an intermediate position, with higher weighted travel times than fixed in many settings because deviations and slack inflate wait and in-vehicle time.
Within the transportation literature, this TransitBench variant is significant because it operationalizes mode-comparison under aligned assumptions rather than evaluating fixed-route, flexible-route, and on-demand systems in separate methodological silos. Its decision guidance is correspondingly conditional: dense corridors favor fixed-route service; low-demand and spatially dispersed settings favor microtransit if the objective prioritizes access and user experience; intermediate cases may justify semi-flexible service if checkpoint density and deviation budgets are tuned carefully.
5. TRNDP benchmarking and AlphaTransit
AlphaTransit contributes a city-scale, simulation-based benchmark for the Transit Route Network Design Problem (TRNDP) and explicitly frames its benchmark, metrics, and evaluation protocol as a TransitBench component (Poudel et al., 27 May 2026). The primary evaluation instance is Bloomington, Indiana, with a cross-city generalization check on Laval, Quebec. The Bloomington network covers approximately 152.3 km² and is represented as a planar, undirected graph processed from local street infrastructure and used as a directed edge list for learning and simulation. It contains 143 nodes and 243 bidirectional edges, excludes interstate highway I-69, and assumes uniform free-flow speed 4 m/s.
Demand is derived from U.S. Census LODES (2022) and scaled to approximate peak-hour mixed purposes. The OD matrix 5 is obtained by scaling commuting flows by 150% to include non-commuting trips and converting to hourly demand using an 11% peak-hour share. Transit demand is then defined by a modal split parameter 6:
7
The benchmark uses two scenarios: mixed demand with 8 and full transit demand with 9. Sequential decisions construct 0 routes, each a simple path of maximum length 1 nodes, and all routes start at a specified transit-center hub. The simulator is UXsim, a mesoscopic traffic simulator with Newell’s car-following model, advancing at 2 s for 3 steps. Buses have 40-passenger capacity, 60 s dwell time per stop, and vehicle platoons of 4.
Evaluation is standardized around service quality, operational burden, and network structure. The formal objective maximizes expected simulation-derived performance over route design tuple 5, with reduced route-only reward
6
Frequencies are assigned deterministically through a capacity-based max-load projection
7
TransitBench reporting adopts seven metrics: service rate 8, average wait time, average journey time of served riders, transfer rate, route efficiency in passengers per kilometer, fleet size, and bus utilization.
AlphaTransit couples Monte Carlo Tree Search with a neural policy-value network. At each state, the admissible action set is
9
with invalid-action masking and PUCT-based selection
0
The policy-value network uses a shared GATv2 backbone with Jumping Knowledge aggregation, a node-wise actor MLP, and a graph-level critic MLP. Training uses approximately 1 environment steps per scenario, and the final comparisons use 2 MCTS simulations per decision. Evaluation reports mean and standard deviation over 10 seeds.
On the Bloomington benchmark, AlphaTransit attains the highest service rate in both demand settings. Under mixed demand, it achieves 3, fleet size 80, bus utilization 4, journey time 5 min, transfer rate 6, route efficiency 7 pax/km 8, and wait time 9 min. These correspond to a 9.9% service-rate gain over End-to-End RL and a 2.5% gain over Pure MCTS. Under full transit demand, AlphaTransit reaches 0, with best wait time 1 min, best route efficiency 2 pax/km 3, highest bus utilization 4, fleet size 267, journey time 5 min, and transfer rate 6, giving 11.4% and 11.2% service-rate gains over End-to-End RL and Pure MCTS respectively. In Laval, without fine-tuning and under 7, it reaches 8 service rate.
The benchmark proposal is as important as the algorithm. It specifies canonical schemas for graph, demand, hub, route tuple output, and simulator configuration; fixes search budgets and seed reporting; and recommends baseline wrappers spanning heuristics, metaheuristics, search-only, RL-only, and real-world GTFS references. The paper also makes explicit the benchmark’s present assumptions: single-hub starts, static peak-hour OD, deterministic frequency projection, and UXsim-based mesoscopic fidelity.
6. Transit analytics workloads: TTCTR and AcumM
"New structures to solve aggregated queries for trips over public transportation networks" maps a public-transport analytics workload into a TransitBench-style benchmark suite by formalizing two compact data structures, TTCTR and AcumM, for complementary query classes (Brisaboa et al., 2019). The work begins from a conceptual model tailored to transit regularities: each vehicle journey follows the fixed stop sequence of its line, and all passengers on the same journey share the same arrival time at each stop. This enables temporal information to be attached to journeys rather than to individual passengers.
The common offer representation stores, for each line 9, a stop array, accumulated per-stop times 0, a schedule array of journey start times 1, and an inverted index from stops to lines. On top of this, TTCTR encodes user trips as concatenated sequences of stage-start symbols and a final-stop symbol terminated by `2\Psi3X4Y5Y\$\hat{\beta}=2.087$6.
TTCTR is optimized for user-centric pattern queries. It supports exact OD counts, line-specific OD restrictions, time filtering through journey-id range counts, and broader switch-pattern analysis. If $\hat{\beta}=2.087$7 is the pattern length and $\hat{\beta}=2.087$8 the spatial alphabet size, OD counting costs $\hat{\beta}=2.087$9; time-filtered OD adds a Wavelet Matrix range count in $\hat{n}\approx 0.90$0. On the Madrid synthetic workload, with 10 million trips and TTCTR sequence length $\hat{n}\approx 0.90$1 symbols, the compressed suffix array occupies 20.58–27.13 MB, the Wavelet Matrix 37.98–42.35 MB, and common offer structures 7,616 KiB, for a total of roughly 66–77 MB. Measured OD query latencies over 10,000 random queries are 6.9–22.9 $\hat{n}\approx 0.90$2s for basic OD counts, and about 29–196 $\hat{n}\approx 0.90$3s when end-line partitioning and day filters are applied.
AcumM serves the complementary aggregate workload. For each line $\hat{n}\approx 0.90$4, it stores two integer matrices: $\hat{n}\approx 0.90$5 for boardings on journey $\hat{n}\approx 0.90$6 at stop $\hat{n}\approx 0.90$7, and $\hat{n}\approx 0.90$8 for alightings. Two-dimensional prefix sums then enable constant-time rectangle queries:
$\hat{n}\approx 0.90$9
Occupancy on journey $\hat{t}_{1/2}\approx 0.33$0 between stops $\hat{t}_{1/2}\approx 0.33$1 and $\hat{t}_{1/2}\approx 0.33$2 is computed as cumulative boardings minus cumulative alightings up to stop $\hat{t}_{1/2}\approx 0.33$3. Differential encoding around the middle column reduces storage by about half in the reported experiments. For the Madrid workload, the get-on accumulated matrix occupies 11,189 KB and the get-off matrix another 11,189 KB; differential versions reduce each to 5,596 KB. Query times over 20,000 tests are 76–221 ns depending on query type, which is more than an order of magnitude faster than TTCTR for analogous aggregate counts.
The benchmark significance of TTCTR and AcumM lies in the clean separation of workloads. TTCTR is the benchmark baseline when exact OD and trip-pattern retrieval matter. AcumM is the benchmark baseline when the workload is dominated by counts, occupancies, sliding windows, and top-$\hat{t}_{1/2}\approx 0.33$4 stop or line usage. The paper’s experimental crossover point is explicit: AcumM is 10×–50× faster than TTCTR on simple boarding aggregates, while TTCTR uniquely supports exact user-level OD and switch-pattern queries.
7. Schedule-based user-equilibrium modules and recurrent limitations
"Enforcing Priority in Schedule-based User Equilibrium Transit Assignment" contributes a rigorous formulation family for benchmarked schedule-based assignment with denied boarding, and explicitly positions these formulations as TransitBench modules (Feng et al., 12 Jan 2026). The modeling objective is to enforce two behavioral rules without explicit dynamic loading: continuance priority, under which onboard passengers retain seats, and first-come-first-served boarding among waiting passengers. On an event-activity graph, boarding priority is encoded through available capacity on inbound arcs to departure events:
$\hat{t}_{1/2}\approx 0.33$5
If $\hat{t}_{1/2}\approx 0.33$6, lower-priority arcs cannot reliably board and must queue for later runs.
The paper provides two complementarity formulations. The route-level NCP reproduces the implicit-priority concept associated with Nguyen et al. (2001), while the refined arc-level NCP enforces priority directly at each event:
$\hat{t}_{1/2}\approx 0.33$7
Generalized route costs are then augmented by arc penalties,
$\hat{t}_{1/2}\approx 0.33$8
and equilibrium is solved either through an MPEC based on the Fischer–Burmeister function
$\hat{t}_{1/2}\approx 0.33$9
or by a projected-gradient implicit method with Armijo line search. The paper proves equilibrium existence under positivity and viability assumptions, shows that multiple equilibria may arise in the route-level model, and argues that the refined arc-level model removes behaviorally questionable early-departure artifacts by pricing only the relevant non-common bottleneck arcs.
Benchmark instances include a Nguyen toy network, a transit version of Sioux Falls, and a Hong Kong university commute case with bus, MTR, and elevator bottlenecks. The Hong Kong results are especially benchmark-relevant: elevator queues of 6–14 minutes emerge at total demand 600–700, bus users shift departures earlier by 2 minutes at demand 600 and 5 minutes at demand 700, and late arrivals rise from 24 to 48 as demand increases from 400 to 700. The comparison to an explicit dynamic-loading baseline is framed behaviorally rather than only numerically: the implicit-priority arc-level formulation avoids grouped-loading artifacts and converges more reliably at high demand.
Across the various TransitBench usages, several recurring limitations are explicit rather than incidental. The MBTA forecasting benchmark aggregates to daily, system-wide totals, omits holiday and special-event covariates, and relies on weather from a single airport station; it also uses a delay dataset labeled "deprecated" at source, though the paper reports empirical fidelity after cleaning (Nalamalpu et al., 2 Dec 2025). The VTG benchmark is small, lacks official splits, does not standardize captions in the paper text, and does not specify licensing (Yang et al., 3 Aug 2025). The AlphaTransit benchmark adopts single-hub starts, static peak-hour OD, and deterministic frequency projection, which simplify the TRNDP but narrow the operational design space (Poudel et al., 27 May 2026). TTCTR is static and therefore rebuild-oriented, while AcumM is a batch-built warehouse structure unless incremental prefix-sum maintenance is added (Brisaboa et al., 2019).
Taken together, these works define TransitBench less as a single benchmark than as a benchmark style. Its recurring attributes are explicit task definitions, shared preprocessing and simulation logic, transparent evaluation metrics, reproducibility goals, and methodological comparability across heterogeneous baselines. Its recurring open problems are equally clear: standardizing splits and licenses, extending from aggregated to higher-resolution targets, incorporating richer exogenous covariates and operational constraints, and reconciling benchmark openness with the domain-specific structure that makes each variant technically meaningful.