TacBench: Dual-Domain Benchmarking
- TacBench is a benchmark suite used in two domains, offering standardized evaluation protocols for both touch-based robotics and soccer tactical analysis.
- The tactile benchmark evaluates self-supervised touch representations across diverse sensors and tasks using frozen encoder probing and detailed performance metrics.
- The soccer benchmark quantifies multi-player trajectory forecasting and tactical event recognition by measuring geometric, structural, and semantic fidelity with advanced metrics.
Searching arXiv for the specified TacBench papers and related entries. TacBench is a benchmark name currently used in two distinct arXiv contexts. In robotics, TacBench denotes a standardized benchmarking suite for vision-based tactile sensing introduced with Sparsh, a family of self-supervised touch representation models (Higuera et al., 2024). In sports analytics, TacBench denotes a benchmark for multi-player trajectory forecasting and tactical event recognition in open-play soccer, used to evaluate the generative framework GenTac (Rao et al., 13 Apr 2026). The two benchmarks share a title but differ in modality, task structure, and evaluation protocol: the tactile benchmark emphasizes frozen-backbone probing across perception and manipulation, whereas the soccer benchmark evaluates stochastic generation in continuous trajectories and discrete semantic events.
1. Name, domain, and disambiguation
A common source of ambiguity is that “TacBench” does not refer to a single benchmark across the literature represented here. One usage is tactile and robot-centric; the other is tactic-centric and sport-analytic. This suggests that the benchmark name requires domain qualification in citation and discussion.
| TacBench usage | Domain | Core components |
|---|---|---|
| Sparsh TacBench | Vision-based tactile sensing | Six tasks, frozen encoder probing, multi-sensor evaluation |
| GenTac TacBench | Open-play soccer tactics | Trajectory forecasting, event recognition, conditional generation |
In the tactile setting, TacBench is designed to evaluate any representation, whether pre-trained or learned from scratch, on a common set of tasks and metrics. In the soccer setting, TacBench is designed to provide a unified, quantitative benchmark for multi-player trajectory forecasting and tactical event recognition in open-play soccer. The shared title therefore masks two different benchmark philosophies: one centered on reusable tactile backbones, the other on stochastic tactical modeling (Higuera et al., 2024).
2. TacBench for vision-based tactile sensing
In “Sparsh: Self-supervised touch representations for vision-based tactile sensing,” TacBench is motivated by fragmentation in tactile perception models. The stated problem is that each new task or sensor typically drives its own end-to-end model, requiring expensive labelled data such as forces, slip, and poses, together with extensive per-sensor tuning. TacBench is consequently designed as a standardized, sensor-agnostic evaluation suite spanning “touch-centric” problems from raw physical quantities through higher-level perception to full manipulation policies (Higuera et al., 2024).
The tactile benchmark is coupled to a large unlabeled tactile corpus for self-supervised pre-training. The SSL pool comprises approximately total tactile images, of which are used for SSL. Its sources are Touch-Slide, described as new and containing DIGIT slides, YCB-Slide with DIGIT images, Touch-and-Go with GelSight discrete images, and ObjectFolder 2.0 with images. The covered sensor families are DIGIT, GelSight 2017, and GelSight Mini. The data explicitly spans different lighting rigs, gel color and texture, and camera intrinsics across sensor instances. SSL splits are train and validation for probe monitoring.
The benchmark’s labeled subsets are task-specific and sensor-specific. Typical train, validation, and test splits are . The stated goal is to promote broad generalization across sensor form factors, including lighting and gel markings, and across robot platforms. A notable design choice is that the benchmark is intended to compare both custom end-to-end models and transferable representations under the same probing regime.
3. Task structure and probing protocol in the tactile benchmark
For tasks through 0, TacBench uses a frozen encoder front-end and trains a lightweight decoder or “probe.” The probe comprises a single cross-attention layer with dimension 1, 2 heads, and 3 layer, followed by a 2-layer MLP head with task-dependent output size. Dense tasks use a DPT decoder on intermediate tokens, and the policy task reuses the diffusion-policy training loop while substituting its visual encoder with a frozen Sparsh encoder (Higuera et al., 2024).
| Task | Objective | Evaluation metric |
|---|---|---|
| T1 | Predict instantaneous normal and shear forces | RMSE |
| T2 | Classify slip vs no-slip | 4-score |
| T3 | Estimate relative pose 5 | Accuracy |
| T4 | Predict two-finger grasp success or failure | Accuracy |
| T5 | Recognize 20 textile types | Accuracy |
| T6 | Predict 6 joint commands in bead-maze imitation learning | Demonstration-trajectory MSE; real rollout distance |
The force-estimation task 7 takes two tactile images concatenated in the channel dimension, 8, corresponding to approximately 9 history, and predicts 0. Training uses 1 regression, and evaluation uses root-mean-squared error:
2
Typical dataset size is 3 samples per sensor for DIGIT and GelSight Mini, split 4.
The dense variant 5, Force-Field Visualization, produces per-pixel normal force and a shear vector field over the elastomer from single-frame pairs 6. Its outputs are 7 and 8. The loss is unsupervised reprojection for depth together with photometric or SSIM and smoothness terms for flow, and no ground truth is required.
Slip detection 9 also uses the 0 history 1, predicts 2, and trains with cross-entropy for slip plus concurrent regression of 3 forces with MAE to improve feature learning. The benchmark reports
4
The dataset has 5 samples, with 6 slip, and is split into 7 train and 8 test.
Relative pose estimation 9 predicts three discrete probability distributions over binned 0, 1, and 2 in the sensor frame. The binning is specified as 3 into 4 log-spaced bins and 5 into 6 bins. The loss is the sum of three cross-entropies, and the evaluation metric is classification accuracy. The dataset contains 7 samples, with approximately 8 train, 9 validation, and 0 test examples.
Grasp-stability prediction 1 predicts whether a two-finger grasp will succeed or fail from “before” and “during” GelSight 2017 marker images. Textile classification 2 predicts one of 3 textile types from a 4- to 5-frame GelSight 2017 video clip, with chance level stated as 6. The bead-maze policy task 7 takes tactile history and proprioception and predicts a horizon of joint commands 8. Its metrics are held-out trajectory-matching error over 9 steps and real-robot rollout distance before bead drop or failure.
4. Results and design principles in the tactile benchmark
The central reported benchmarking result is that, under limited labels with 0 to 1 label budgets, Sparsh SSL improves over end-to-end models by a mean relative gain of 2 across 3 through 4. The best SSL variants are reported as Sparsh (DINO) and Sparsh (IJEPA), with DINO excelling at physics-based tasks and IJEPA excelling at semantic tasks. On average, DINO beats IJEPA by 5 (Higuera et al., 2024).
Representative numbers make the task dependence explicit. For force estimation on DIGIT, the reported RMSE progresses from E2E 6 to DINO 7 to DINOv2 8. For force estimation on GelSight, E2E 9 improves to DINO 0. For slip detection, 1 improves from E2E 2 to VJEPA 3 with full data, and VJEPA reaches 4 at 5 labels. For relative pose estimation, accuracy improves from E2E 6 to DINO 7 at full data, with 8 at 9 labels. For grasp stability, E2E 0 becomes IJEPA 1 at full data and 2 at 3 labels. For textile classification, E2E 4, MAE 5, and DINO 6 are reported. For the bead-maze task, rollout distance improves from E2E 7 to DINO 8, a reported 9, while IJEPA reaches 0.
The accompanying design principles are explicit. Learning in latent feature space through self-distillation and JEPA is reported to filter out gel-marker and lighting noise better than pixel reconstruction with MAE. A small temporal history of approximately 1 via 2 is stated to match human slip-detection timescales and to suffice for force and slip cues. Background subtraction of the static elastomer image is reported to improve cross-sensor robustness. Frozen-encoder probing with cross-attention and MLP is used to isolate representation quality from decoder capacity. Covering three major sensor families and six diverse tasks is described as encouraging reusable, general tactile backbones rather than bespoke models.
5. TacBench for open-play soccer tactics
In “GenTac: Generative Modeling and Forecasting of Soccer Tactics,” TacBench is defined as a unified, quantitative benchmark for multi-player trajectory forecasting and tactical event recognition in open-play soccer. Its scope is derived from public full-match tracking data from three professional leagues together with broadcast-reconstructed trajectories. The benchmark has two components: TacBench-Trajectory, which uses 3 clips with all 4 entities at 5 FPS, and TacBench-Event, which uses 6 clips annotated with one of 7 tactical event subtypes grouped into 8 macro-types (Rao et al., 13 Apr 2026).
The trajectory-forecasting task is formulated as generation from history 9 with optional conditioning 00. Five conditioning variants are specified: unconditioned, opponent-conditioned, team-conditioned, league-conditioned, and objective-conditioned. Tactical event recognition maps an observed or generated trajectory segment to both macro-type and subtype labels. Two settings are distinguished: Event Grounding from observed history and Event Forecasting from generated futures.
The soccer data sources are Metrica Sports with 01 matches, SkillCorner (A-League) with 02 matches, Sportec DFL with 03 matches, SoccerNet-GSR with 04 broadcast-derived clips of 05 each, and SoccerFactory plus refinement with 06 clips. All trajectories are resampled to 07 FPS. Each frame contains 08 player coordinates and one ball coordinate in meters on a 09 pitch, with origin at the centre and missing values recorded as 10. Spatial normalization is to 11, rounded to 12. Preprocessing includes temporal interpolation to 13 FPS, linear imputation for short gaps, global anomaly detection and piecewise interpolation, and exponential-moving-average smoothing.
The event taxonomy is organized as follows.
| Macro-type | Classes |
|---|---|
| Build-Up | Build |
| Transition | Ball Win; Progression |
| Threat | Goal; Shot Off Target; Shot Saved; Clearance; Defended |
| Set Piece | Corner; Free Kick; Penalty; Throw-in; Kick-off; Goal-Kick |
| Interruption | Stoppage/Foul/Substitution |
Events are extracted by buffering official labels with a 14 pre-window and 15 post-window, followed by collision-aware truncation. Minimum clip length is 16 frames or 17. Goals, shots, and free-kicks receive an extra 18 extension. An important nuance is that the macro-type counts sum to 19 subtypes, but only 20 are used in the final benchmark because some extremely rare classes may be merged. This is a definitional detail rather than a contradiction in the benchmark description.
6. Evaluation protocols, splits, and GenTac usage in soccer
All soccer TacBench metrics are evaluated on held-out test splits. For trajectory forecasting, 21 independent rollouts are sampled, and both best-of-22 (“min”) and average (“avg”) performance are reported. Geometric accuracy is measured with ADE and FDE:
23
Collective structure consistency is assessed using Stretch Index, Surface Area, Team Width, Team Length, Frobenius Norm, Centroid Displacement, and the Kuramoto Order Parameter, with reporting in terms of absolute difference between prediction and ground truth. Semantic event recognition reports type-level top-1 and top-3 accuracy and macro-averaged Recall@1 and Recall@3, and subtype-level top-1, top-3, and top-5 accuracy with macro-averaged Recall@1, Recall@3, and Recall@5. Offense and defense analytical metrics are computed using an EPV grid from FoTD (LaurieOnTracking), including Off-Ball Expected Threat, Depth Threat, Width Threat, Defensive Shape Disruption, and Defensive Dominant Region (Rao et al., 13 Apr 2026).
The trajectory splits are specified source by source: Metrica 24 matches for train, validation, and test; SkillCorner 25 matches; DFL 26 matches; SoccerNet-GSR 27 clips; and SoccerFactory 28 clips. The event benchmark uses a 29 ratio with 30 train, 31 validation, and 32 test clips, for 33 total instances across all 34 subtypes. Other sports are also included for cross-domain generalization: basketball with 35 clips, American football with 36 clips, and ice hockey with 37 clips.
GenTac uses TacBench to demonstrate four capabilities. For geometric and structural accuracy in unconditioned forecasting, min-ADE grows from 38 at 39 to 40 at 41, and min-FDE from 42 to 43 with a 44 causal window. Structural deviations are reported as 45 and 46 over 47. For team and league style simulation, team conditioning for Auckland FC reduces surface area error from 48 to 49 at 50, and league conditioning lowers short-horizon ADE by 51 at 52. For controllable counterfactuals, offensive guidance raises off-ball expected threat by 53, depth threat by 54, and width threat by 55; defensive guidance lowers threat metrics, raises disruption by 56, and increases dominant region by 57. For event grounding, type top-1 accuracy is 58, type top-3 is 59, subtype top-1 is 60, subtype top-3 is 61, and subtype top-5 is 62. Event forecasting from generated futures is described as yielding a distribution over events and thereby capturing tactical branching.
Taken together, the two TacBench benchmarks exemplify a shared benchmarking strategy under a reused name: both define standardized evaluation protocols for models that must generalize across heterogeneous contexts, but they do so in sharply different technical regimes. In tactile sensing, TacBench isolates reusable touch representations across sensors and downstream tasks. In soccer tactics, TacBench quantifies the geometric, structural, and semantic fidelity of stochastic multi-agent forecasts.