TacBench: Dual-Domain Benchmarking

Updated 4 July 2026

TacBench is a benchmark suite used in two domains, offering standardized evaluation protocols for both touch-based robotics and soccer tactical analysis.
The tactile benchmark evaluates self-supervised touch representations across diverse sensors and tasks using frozen encoder probing and detailed performance metrics.
The soccer benchmark quantifies multi-player trajectory forecasting and tactical event recognition by measuring geometric, structural, and semantic fidelity with advanced metrics.

Searching arXiv for the specified TacBench papers and related entries. TacBench is a benchmark name currently used in two distinct arXiv contexts. In robotics, TacBench denotes a standardized benchmarking suite for vision-based tactile sensing introduced with Sparsh, a family of self-supervised touch representation models (Higuera et al., 2024). In sports analytics, TacBench denotes a benchmark for multi-player trajectory forecasting and tactical event recognition in open-play soccer, used to evaluate the generative framework GenTac (Rao et al., 13 Apr 2026). The two benchmarks share a title but differ in modality, task structure, and evaluation protocol: the tactile benchmark emphasizes frozen-backbone probing across perception and manipulation, whereas the soccer benchmark evaluates stochastic generation in continuous trajectories and discrete semantic events.

1. Name, domain, and disambiguation

A common source of ambiguity is that “TacBench” does not refer to a single benchmark across the literature represented here. One usage is tactile and robot-centric; the other is tactic-centric and sport-analytic. This suggests that the benchmark name requires domain qualification in citation and discussion.

TacBench usage	Domain	Core components
Sparsh TacBench	Vision-based tactile sensing	Six tasks, frozen encoder probing, multi-sensor evaluation
GenTac TacBench	Open-play soccer tactics	Trajectory forecasting, event recognition, conditional generation

In the tactile setting, TacBench is designed to evaluate any representation, whether pre-trained or learned from scratch, on a common set of tasks and metrics. In the soccer setting, TacBench is designed to provide a unified, quantitative benchmark for multi-player trajectory forecasting and tactical event recognition in open-play soccer. The shared title therefore masks two different benchmark philosophies: one centered on reusable tactile backbones, the other on stochastic tactical modeling (Higuera et al., 2024).

2. TacBench for vision-based tactile sensing

In “Sparsh: Self-supervised touch representations for vision-based tactile sensing,” TacBench is motivated by fragmentation in tactile perception models. The stated problem is that each new task or sensor typically drives its own end-to-end model, requiring expensive labelled data such as forces, slip, and poses, together with extensive per-sensor tuning. TacBench is consequently designed as a standardized, sensor-agnostic evaluation suite spanning “touch-centric” problems from raw physical quantities through higher-level perception to full manipulation policies (Higuera et al., 2024).

The tactile benchmark is coupled to a large unlabeled tactile corpus for self-supervised pre-training. The SSL pool comprises approximately $661\text{ k}$ total tactile images, of which $462\text{ k}$ are used for SSL. Its sources are Touch-Slide, described as new and containing $180\text{ k}$ DIGIT slides, YCB-Slide with $180\text{ k}$ DIGIT images, Touch-and-Go with $220\text{ k}$ GelSight discrete images, and ObjectFolder 2.0 with $81\text{ k}$ images. The covered sensor families are DIGIT, GelSight 2017, and GelSight Mini. The data explicitly spans different lighting rigs, gel color and texture, and camera intrinsics across sensor instances. SSL splits are $70\%$ train and $30\%$ validation for probe monitoring.

The benchmark’s labeled subsets are task-specific and sensor-specific. Typical train, validation, and test splits are $66/17/17\%$ . The stated goal is to promote broad generalization across sensor form factors, including lighting and gel markings, and across robot platforms. A notable design choice is that the benchmark is intended to compare both custom end-to-end models and transferable representations under the same probing regime.

3. Task structure and probing protocol in the tactile benchmark

For tasks $T1$ through $462\text{ k}$ 0, TacBench uses a frozen encoder front-end and trains a lightweight decoder or “probe.” The probe comprises a single cross-attention layer with dimension $462\text{ k}$ 1, $462\text{ k}$ 2 heads, and $462\text{ k}$ 3 layer, followed by a 2-layer MLP head with task-dependent output size. Dense tasks use a DPT decoder on intermediate tokens, and the policy task reuses the diffusion-policy training loop while substituting its visual encoder with a frozen Sparsh encoder (Higuera et al., 2024).

Task	Objective	Evaluation metric
T1	Predict instantaneous normal and shear forces	RMSE
T2	Classify slip vs no-slip	$462\text{ k}$ 4-score
T3	Estimate relative pose $462\text{ k}$ 5	Accuracy
T4	Predict two-finger grasp success or failure	Accuracy
T5	Recognize 20 textile types	Accuracy
T6	Predict $462\text{ k}$ 6 joint commands in bead-maze imitation learning	Demonstration-trajectory MSE; real rollout distance

The force-estimation task $462\text{ k}$ 7 takes two tactile images concatenated in the channel dimension, $462\text{ k}$ 8, corresponding to approximately $462\text{ k}$ 9 history, and predicts $180\text{ k}$ 0. Training uses $180\text{ k}$ 1 regression, and evaluation uses root-mean-squared error:

$180\text{ k}$ 2

Typical dataset size is $180\text{ k}$ 3 samples per sensor for DIGIT and GelSight Mini, split $180\text{ k}$ 4.

The dense variant $180\text{ k}$ 5, Force-Field Visualization, produces per-pixel normal force and a shear vector field over the elastomer from single-frame pairs $180\text{ k}$ 6. Its outputs are $180\text{ k}$ 7 and $180\text{ k}$ 8. The loss is unsupervised reprojection for depth together with photometric or SSIM and smoothness terms for flow, and no ground truth is required.

Slip detection $180\text{ k}$ 9 also uses the $180\text{ k}$ 0 history $180\text{ k}$ 1, predicts $180\text{ k}$ 2, and trains with cross-entropy for slip plus concurrent regression of $180\text{ k}$ 3 forces with MAE to improve feature learning. The benchmark reports

$180\text{ k}$ 4

The dataset has $180\text{ k}$ 5 samples, with $180\text{ k}$ 6 slip, and is split into $180\text{ k}$ 7 train and $180\text{ k}$ 8 test.

Relative pose estimation $180\text{ k}$ 9 predicts three discrete probability distributions over binned $220\text{ k}$ 0, $220\text{ k}$ 1, and $220\text{ k}$ 2 in the sensor frame. The binning is specified as $220\text{ k}$ 3 into $220\text{ k}$ 4 log-spaced bins and $220\text{ k}$ 5 into $220\text{ k}$ 6 bins. The loss is the sum of three cross-entropies, and the evaluation metric is classification accuracy. The dataset contains $220\text{ k}$ 7 samples, with approximately $220\text{ k}$ 8 train, $220\text{ k}$ 9 validation, and $81\text{ k}$ 0 test examples.

Grasp-stability prediction $81\text{ k}$ 1 predicts whether a two-finger grasp will succeed or fail from “before” and “during” GelSight 2017 marker images. Textile classification $81\text{ k}$ 2 predicts one of $81\text{ k}$ 3 textile types from a $81\text{ k}$ 4- to $81\text{ k}$ 5-frame GelSight 2017 video clip, with chance level stated as $81\text{ k}$ 6. The bead-maze policy task $81\text{ k}$ 7 takes tactile history and proprioception and predicts a horizon of joint commands $81\text{ k}$ 8. Its metrics are held-out trajectory-matching error over $81\text{ k}$ 9 steps and real-robot rollout distance before bead drop or failure.

4. Results and design principles in the tactile benchmark

The central reported benchmarking result is that, under limited labels with $70\%$ 0 to $70\%$ 1 label budgets, Sparsh SSL improves over end-to-end models by a mean relative gain of $70\%$ 2 across $70\%$ 3 through $70\%$ 4. The best SSL variants are reported as Sparsh (DINO) and Sparsh (IJEPA), with DINO excelling at physics-based tasks and IJEPA excelling at semantic tasks. On average, DINO beats IJEPA by $70\%$ 5 (Higuera et al., 2024).

Representative numbers make the task dependence explicit. For force estimation on DIGIT, the reported RMSE progresses from E2E $70\%$ 6 to DINO $70\%$ 7 to DINOv2 $70\%$ 8. For force estimation on GelSight, E2E $70\%$ 9 improves to DINO $30\%$ 0. For slip detection, $30\%$ 1 improves from E2E $30\%$ 2 to VJEPA $30\%$ 3 with full data, and VJEPA reaches $30\%$ 4 at $30\%$ 5 labels. For relative pose estimation, accuracy improves from E2E $30\%$ 6 to DINO $30\%$ 7 at full data, with $30\%$ 8 at $30\%$ 9 labels. For grasp stability, E2E $66/17/17\%$ 0 becomes IJEPA $66/17/17\%$ 1 at full data and $66/17/17\%$ 2 at $66/17/17\%$ 3 labels. For textile classification, E2E $66/17/17\%$ 4, MAE $66/17/17\%$ 5, and DINO $66/17/17\%$ 6 are reported. For the bead-maze task, rollout distance improves from E2E $66/17/17\%$ 7 to DINO $66/17/17\%$ 8, a reported $66/17/17\%$ 9, while IJEPA reaches $T1$ 0.

The accompanying design principles are explicit. Learning in latent feature space through self-distillation and JEPA is reported to filter out gel-marker and lighting noise better than pixel reconstruction with MAE. A small temporal history of approximately $T1$ 1 via $T1$ 2 is stated to match human slip-detection timescales and to suffice for force and slip cues. Background subtraction of the static elastomer image is reported to improve cross-sensor robustness. Frozen-encoder probing with cross-attention and MLP is used to isolate representation quality from decoder capacity. Covering three major sensor families and six diverse tasks is described as encouraging reusable, general tactile backbones rather than bespoke models.

5. TacBench for open-play soccer tactics

In “GenTac: Generative Modeling and Forecasting of Soccer Tactics,” TacBench is defined as a unified, quantitative benchmark for multi-player trajectory forecasting and tactical event recognition in open-play soccer. Its scope is derived from public full-match tracking data from three professional leagues together with broadcast-reconstructed trajectories. The benchmark has two components: TacBench-Trajectory, which uses $T1$ 3 clips with all $T1$ 4 entities at $T1$ 5 FPS, and TacBench-Event, which uses $T1$ 6 clips annotated with one of $T1$ 7 tactical event subtypes grouped into $T1$ 8 macro-types (Rao et al., 13 Apr 2026).

The trajectory-forecasting task is formulated as generation from history $T1$ 9 with optional conditioning $462\text{ k}$ 00. Five conditioning variants are specified: unconditioned, opponent-conditioned, team-conditioned, league-conditioned, and objective-conditioned. Tactical event recognition maps an observed or generated trajectory segment to both macro-type and subtype labels. Two settings are distinguished: Event Grounding from observed history and Event Forecasting from generated futures.

The soccer data sources are Metrica Sports with $462\text{ k}$ 01 matches, SkillCorner (A-League) with $462\text{ k}$ 02 matches, Sportec DFL with $462\text{ k}$ 03 matches, SoccerNet-GSR with $462\text{ k}$ 04 broadcast-derived clips of $462\text{ k}$ 05 each, and SoccerFactory plus refinement with $462\text{ k}$ 06 clips. All trajectories are resampled to $462\text{ k}$ 07 FPS. Each frame contains $462\text{ k}$ 08 player coordinates and one ball coordinate in meters on a $462\text{ k}$ 09 pitch, with origin at the centre and missing values recorded as $462\text{ k}$ 10. Spatial normalization is to $462\text{ k}$ 11, rounded to $462\text{ k}$ 12. Preprocessing includes temporal interpolation to $462\text{ k}$ 13 FPS, linear imputation for short gaps, global anomaly detection and piecewise interpolation, and exponential-moving-average smoothing.

The event taxonomy is organized as follows.

Macro-type	Classes
Build-Up	Build
Transition	Ball Win; Progression
Threat	Goal; Shot Off Target; Shot Saved; Clearance; Defended
Set Piece	Corner; Free Kick; Penalty; Throw-in; Kick-off; Goal-Kick
Interruption	Stoppage/Foul/Substitution

Events are extracted by buffering official labels with a $462\text{ k}$ 14 pre-window and $462\text{ k}$ 15 post-window, followed by collision-aware truncation. Minimum clip length is $462\text{ k}$ 16 frames or $462\text{ k}$ 17. Goals, shots, and free-kicks receive an extra $462\text{ k}$ 18 extension. An important nuance is that the macro-type counts sum to $462\text{ k}$ 19 subtypes, but only $462\text{ k}$ 20 are used in the final benchmark because some extremely rare classes may be merged. This is a definitional detail rather than a contradiction in the benchmark description.

6. Evaluation protocols, splits, and GenTac usage in soccer

All soccer TacBench metrics are evaluated on held-out test splits. For trajectory forecasting, $462\text{ k}$ 21 independent rollouts are sampled, and both best-of- $462\text{ k}$ 22 (“min”) and average (“avg”) performance are reported. Geometric accuracy is measured with ADE and FDE:

$462\text{ k}$ 23

Collective structure consistency is assessed using Stretch Index, Surface Area, Team Width, Team Length, Frobenius Norm, Centroid Displacement, and the Kuramoto Order Parameter, with reporting in terms of absolute difference between prediction and ground truth. Semantic event recognition reports type-level top-1 and top-3 accuracy and macro-averaged Recall@1 and Recall@3, and subtype-level top-1, top-3, and top-5 accuracy with macro-averaged Recall@1, Recall@3, and Recall@5. Offense and defense analytical metrics are computed using an EPV grid from FoTD (LaurieOnTracking), including Off-Ball Expected Threat, Depth Threat, Width Threat, Defensive Shape Disruption, and Defensive Dominant Region (Rao et al., 13 Apr 2026).

The trajectory splits are specified source by source: Metrica $462\text{ k}$ 24 matches for train, validation, and test; SkillCorner $462\text{ k}$ 25 matches; DFL $462\text{ k}$ 26 matches; SoccerNet-GSR $462\text{ k}$ 27 clips; and SoccerFactory $462\text{ k}$ 28 clips. The event benchmark uses a $462\text{ k}$ 29 ratio with $462\text{ k}$ 30 train, $462\text{ k}$ 31 validation, and $462\text{ k}$ 32 test clips, for $462\text{ k}$ 33 total instances across all $462\text{ k}$ 34 subtypes. Other sports are also included for cross-domain generalization: basketball with $462\text{ k}$ 35 clips, American football with $462\text{ k}$ 36 clips, and ice hockey with $462\text{ k}$ 37 clips.

GenTac uses TacBench to demonstrate four capabilities. For geometric and structural accuracy in unconditioned forecasting, min-ADE grows from $462\text{ k}$ 38 at $462\text{ k}$ 39 to $462\text{ k}$ 40 at $462\text{ k}$ 41, and min-FDE from $462\text{ k}$ 42 to $462\text{ k}$ 43 with a $462\text{ k}$ 44 causal window. Structural deviations are reported as $462\text{ k}$ 45 and $462\text{ k}$ 46 over $462\text{ k}$ 47. For team and league style simulation, team conditioning for Auckland FC reduces surface area error from $462\text{ k}$ 48 to $462\text{ k}$ 49 at $462\text{ k}$ 50, and league conditioning lowers short-horizon ADE by $462\text{ k}$ 51 at $462\text{ k}$ 52. For controllable counterfactuals, offensive guidance raises off-ball expected threat by $462\text{ k}$ 53, depth threat by $462\text{ k}$ 54, and width threat by $462\text{ k}$ 55; defensive guidance lowers threat metrics, raises disruption by $462\text{ k}$ 56, and increases dominant region by $462\text{ k}$ 57. For event grounding, type top-1 accuracy is $462\text{ k}$ 58, type top-3 is $462\text{ k}$ 59, subtype top-1 is $462\text{ k}$ 60, subtype top-3 is $462\text{ k}$ 61, and subtype top-5 is $462\text{ k}$ 62. Event forecasting from generated futures is described as yielding a distribution over events and thereby capturing tactical branching.

Taken together, the two TacBench benchmarks exemplify a shared benchmarking strategy under a reused name: both define standardized evaluation protocols for models that must generalize across heterogeneous contexts, but they do so in sharply different technical regimes. In tactile sensing, TacBench isolates reusable touch representations across sensors and downstream tasks. In soccer tactics, TacBench quantifies the geometric, structural, and semantic fidelity of stochastic multi-agent forecasts.

Markdown Report Issue Upgrade to Chat

References (2)

Sparsh: Self-supervised touch representations for vision-based tactile sensing (2024)

GenTac: Generative Modeling and Forecasting of Soccer Tactics (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TacBench.