Papers
Topics
Authors
Recent
Search
2000 character limit reached

TacBench: Dual-Domain Benchmarking

Updated 4 July 2026
  • TacBench is a benchmark suite used in two domains, offering standardized evaluation protocols for both touch-based robotics and soccer tactical analysis.
  • The tactile benchmark evaluates self-supervised touch representations across diverse sensors and tasks using frozen encoder probing and detailed performance metrics.
  • The soccer benchmark quantifies multi-player trajectory forecasting and tactical event recognition by measuring geometric, structural, and semantic fidelity with advanced metrics.

Searching arXiv for the specified TacBench papers and related entries. TacBench is a benchmark name currently used in two distinct arXiv contexts. In robotics, TacBench denotes a standardized benchmarking suite for vision-based tactile sensing introduced with Sparsh, a family of self-supervised touch representation models (Higuera et al., 2024). In sports analytics, TacBench denotes a benchmark for multi-player trajectory forecasting and tactical event recognition in open-play soccer, used to evaluate the generative framework GenTac (Rao et al., 13 Apr 2026). The two benchmarks share a title but differ in modality, task structure, and evaluation protocol: the tactile benchmark emphasizes frozen-backbone probing across perception and manipulation, whereas the soccer benchmark evaluates stochastic generation in continuous trajectories and discrete semantic events.

1. Name, domain, and disambiguation

A common source of ambiguity is that “TacBench” does not refer to a single benchmark across the literature represented here. One usage is tactile and robot-centric; the other is tactic-centric and sport-analytic. This suggests that the benchmark name requires domain qualification in citation and discussion.

TacBench usage Domain Core components
Sparsh TacBench Vision-based tactile sensing Six tasks, frozen encoder probing, multi-sensor evaluation
GenTac TacBench Open-play soccer tactics Trajectory forecasting, event recognition, conditional generation

In the tactile setting, TacBench is designed to evaluate any representation, whether pre-trained or learned from scratch, on a common set of tasks and metrics. In the soccer setting, TacBench is designed to provide a unified, quantitative benchmark for multi-player trajectory forecasting and tactical event recognition in open-play soccer. The shared title therefore masks two different benchmark philosophies: one centered on reusable tactile backbones, the other on stochastic tactical modeling (Higuera et al., 2024).

2. TacBench for vision-based tactile sensing

In “Sparsh: Self-supervised touch representations for vision-based tactile sensing,” TacBench is motivated by fragmentation in tactile perception models. The stated problem is that each new task or sensor typically drives its own end-to-end model, requiring expensive labelled data such as forces, slip, and poses, together with extensive per-sensor tuning. TacBench is consequently designed as a standardized, sensor-agnostic evaluation suite spanning “touch-centric” problems from raw physical quantities through higher-level perception to full manipulation policies (Higuera et al., 2024).

The tactile benchmark is coupled to a large unlabeled tactile corpus for self-supervised pre-training. The SSL pool comprises approximately 661 k661\text{ k} total tactile images, of which 462 k462\text{ k} are used for SSL. Its sources are Touch-Slide, described as new and containing 180 k180\text{ k} DIGIT slides, YCB-Slide with 180 k180\text{ k} DIGIT images, Touch-and-Go with 220 k220\text{ k} GelSight discrete images, and ObjectFolder 2.0 with 81 k81\text{ k} images. The covered sensor families are DIGIT, GelSight 2017, and GelSight Mini. The data explicitly spans different lighting rigs, gel color and texture, and camera intrinsics across sensor instances. SSL splits are 70%70\% train and 30%30\% validation for probe monitoring.

The benchmark’s labeled subsets are task-specific and sensor-specific. Typical train, validation, and test splits are 66/17/17%66/17/17\%. The stated goal is to promote broad generalization across sensor form factors, including lighting and gel markings, and across robot platforms. A notable design choice is that the benchmark is intended to compare both custom end-to-end models and transferable representations under the same probing regime.

3. Task structure and probing protocol in the tactile benchmark

For tasks T1T1 through 462 k462\text{ k}0, TacBench uses a frozen encoder front-end and trains a lightweight decoder or “probe.” The probe comprises a single cross-attention layer with dimension 462 k462\text{ k}1, 462 k462\text{ k}2 heads, and 462 k462\text{ k}3 layer, followed by a 2-layer MLP head with task-dependent output size. Dense tasks use a DPT decoder on intermediate tokens, and the policy task reuses the diffusion-policy training loop while substituting its visual encoder with a frozen Sparsh encoder (Higuera et al., 2024).

Task Objective Evaluation metric
T1 Predict instantaneous normal and shear forces RMSE
T2 Classify slip vs no-slip 462 k462\text{ k}4-score
T3 Estimate relative pose 462 k462\text{ k}5 Accuracy
T4 Predict two-finger grasp success or failure Accuracy
T5 Recognize 20 textile types Accuracy
T6 Predict 462 k462\text{ k}6 joint commands in bead-maze imitation learning Demonstration-trajectory MSE; real rollout distance

The force-estimation task 462 k462\text{ k}7 takes two tactile images concatenated in the channel dimension, 462 k462\text{ k}8, corresponding to approximately 462 k462\text{ k}9 history, and predicts 180 k180\text{ k}0. Training uses 180 k180\text{ k}1 regression, and evaluation uses root-mean-squared error:

180 k180\text{ k}2

Typical dataset size is 180 k180\text{ k}3 samples per sensor for DIGIT and GelSight Mini, split 180 k180\text{ k}4.

The dense variant 180 k180\text{ k}5, Force-Field Visualization, produces per-pixel normal force and a shear vector field over the elastomer from single-frame pairs 180 k180\text{ k}6. Its outputs are 180 k180\text{ k}7 and 180 k180\text{ k}8. The loss is unsupervised reprojection for depth together with photometric or SSIM and smoothness terms for flow, and no ground truth is required.

Slip detection 180 k180\text{ k}9 also uses the 180 k180\text{ k}0 history 180 k180\text{ k}1, predicts 180 k180\text{ k}2, and trains with cross-entropy for slip plus concurrent regression of 180 k180\text{ k}3 forces with MAE to improve feature learning. The benchmark reports

180 k180\text{ k}4

The dataset has 180 k180\text{ k}5 samples, with 180 k180\text{ k}6 slip, and is split into 180 k180\text{ k}7 train and 180 k180\text{ k}8 test.

Relative pose estimation 180 k180\text{ k}9 predicts three discrete probability distributions over binned 220 k220\text{ k}0, 220 k220\text{ k}1, and 220 k220\text{ k}2 in the sensor frame. The binning is specified as 220 k220\text{ k}3 into 220 k220\text{ k}4 log-spaced bins and 220 k220\text{ k}5 into 220 k220\text{ k}6 bins. The loss is the sum of three cross-entropies, and the evaluation metric is classification accuracy. The dataset contains 220 k220\text{ k}7 samples, with approximately 220 k220\text{ k}8 train, 220 k220\text{ k}9 validation, and 81 k81\text{ k}0 test examples.

Grasp-stability prediction 81 k81\text{ k}1 predicts whether a two-finger grasp will succeed or fail from “before” and “during” GelSight 2017 marker images. Textile classification 81 k81\text{ k}2 predicts one of 81 k81\text{ k}3 textile types from a 81 k81\text{ k}4- to 81 k81\text{ k}5-frame GelSight 2017 video clip, with chance level stated as 81 k81\text{ k}6. The bead-maze policy task 81 k81\text{ k}7 takes tactile history and proprioception and predicts a horizon of joint commands 81 k81\text{ k}8. Its metrics are held-out trajectory-matching error over 81 k81\text{ k}9 steps and real-robot rollout distance before bead drop or failure.

4. Results and design principles in the tactile benchmark

The central reported benchmarking result is that, under limited labels with 70%70\%0 to 70%70\%1 label budgets, Sparsh SSL improves over end-to-end models by a mean relative gain of 70%70\%2 across 70%70\%3 through 70%70\%4. The best SSL variants are reported as Sparsh (DINO) and Sparsh (IJEPA), with DINO excelling at physics-based tasks and IJEPA excelling at semantic tasks. On average, DINO beats IJEPA by 70%70\%5 (Higuera et al., 2024).

Representative numbers make the task dependence explicit. For force estimation on DIGIT, the reported RMSE progresses from E2E 70%70\%6 to DINO 70%70\%7 to DINOv2 70%70\%8. For force estimation on GelSight, E2E 70%70\%9 improves to DINO 30%30\%0. For slip detection, 30%30\%1 improves from E2E 30%30\%2 to VJEPA 30%30\%3 with full data, and VJEPA reaches 30%30\%4 at 30%30\%5 labels. For relative pose estimation, accuracy improves from E2E 30%30\%6 to DINO 30%30\%7 at full data, with 30%30\%8 at 30%30\%9 labels. For grasp stability, E2E 66/17/17%66/17/17\%0 becomes IJEPA 66/17/17%66/17/17\%1 at full data and 66/17/17%66/17/17\%2 at 66/17/17%66/17/17\%3 labels. For textile classification, E2E 66/17/17%66/17/17\%4, MAE 66/17/17%66/17/17\%5, and DINO 66/17/17%66/17/17\%6 are reported. For the bead-maze task, rollout distance improves from E2E 66/17/17%66/17/17\%7 to DINO 66/17/17%66/17/17\%8, a reported 66/17/17%66/17/17\%9, while IJEPA reaches T1T10.

The accompanying design principles are explicit. Learning in latent feature space through self-distillation and JEPA is reported to filter out gel-marker and lighting noise better than pixel reconstruction with MAE. A small temporal history of approximately T1T11 via T1T12 is stated to match human slip-detection timescales and to suffice for force and slip cues. Background subtraction of the static elastomer image is reported to improve cross-sensor robustness. Frozen-encoder probing with cross-attention and MLP is used to isolate representation quality from decoder capacity. Covering three major sensor families and six diverse tasks is described as encouraging reusable, general tactile backbones rather than bespoke models.

5. TacBench for open-play soccer tactics

In “GenTac: Generative Modeling and Forecasting of Soccer Tactics,” TacBench is defined as a unified, quantitative benchmark for multi-player trajectory forecasting and tactical event recognition in open-play soccer. Its scope is derived from public full-match tracking data from three professional leagues together with broadcast-reconstructed trajectories. The benchmark has two components: TacBench-Trajectory, which uses T1T13 clips with all T1T14 entities at T1T15 FPS, and TacBench-Event, which uses T1T16 clips annotated with one of T1T17 tactical event subtypes grouped into T1T18 macro-types (Rao et al., 13 Apr 2026).

The trajectory-forecasting task is formulated as generation from history T1T19 with optional conditioning 462 k462\text{ k}00. Five conditioning variants are specified: unconditioned, opponent-conditioned, team-conditioned, league-conditioned, and objective-conditioned. Tactical event recognition maps an observed or generated trajectory segment to both macro-type and subtype labels. Two settings are distinguished: Event Grounding from observed history and Event Forecasting from generated futures.

The soccer data sources are Metrica Sports with 462 k462\text{ k}01 matches, SkillCorner (A-League) with 462 k462\text{ k}02 matches, Sportec DFL with 462 k462\text{ k}03 matches, SoccerNet-GSR with 462 k462\text{ k}04 broadcast-derived clips of 462 k462\text{ k}05 each, and SoccerFactory plus refinement with 462 k462\text{ k}06 clips. All trajectories are resampled to 462 k462\text{ k}07 FPS. Each frame contains 462 k462\text{ k}08 player coordinates and one ball coordinate in meters on a 462 k462\text{ k}09 pitch, with origin at the centre and missing values recorded as 462 k462\text{ k}10. Spatial normalization is to 462 k462\text{ k}11, rounded to 462 k462\text{ k}12. Preprocessing includes temporal interpolation to 462 k462\text{ k}13 FPS, linear imputation for short gaps, global anomaly detection and piecewise interpolation, and exponential-moving-average smoothing.

The event taxonomy is organized as follows.

Macro-type Classes
Build-Up Build
Transition Ball Win; Progression
Threat Goal; Shot Off Target; Shot Saved; Clearance; Defended
Set Piece Corner; Free Kick; Penalty; Throw-in; Kick-off; Goal-Kick
Interruption Stoppage/Foul/Substitution

Events are extracted by buffering official labels with a 462 k462\text{ k}14 pre-window and 462 k462\text{ k}15 post-window, followed by collision-aware truncation. Minimum clip length is 462 k462\text{ k}16 frames or 462 k462\text{ k}17. Goals, shots, and free-kicks receive an extra 462 k462\text{ k}18 extension. An important nuance is that the macro-type counts sum to 462 k462\text{ k}19 subtypes, but only 462 k462\text{ k}20 are used in the final benchmark because some extremely rare classes may be merged. This is a definitional detail rather than a contradiction in the benchmark description.

6. Evaluation protocols, splits, and GenTac usage in soccer

All soccer TacBench metrics are evaluated on held-out test splits. For trajectory forecasting, 462 k462\text{ k}21 independent rollouts are sampled, and both best-of-462 k462\text{ k}22 (“min”) and average (“avg”) performance are reported. Geometric accuracy is measured with ADE and FDE:

462 k462\text{ k}23

Collective structure consistency is assessed using Stretch Index, Surface Area, Team Width, Team Length, Frobenius Norm, Centroid Displacement, and the Kuramoto Order Parameter, with reporting in terms of absolute difference between prediction and ground truth. Semantic event recognition reports type-level top-1 and top-3 accuracy and macro-averaged Recall@1 and Recall@3, and subtype-level top-1, top-3, and top-5 accuracy with macro-averaged Recall@1, Recall@3, and Recall@5. Offense and defense analytical metrics are computed using an EPV grid from FoTD (LaurieOnTracking), including Off-Ball Expected Threat, Depth Threat, Width Threat, Defensive Shape Disruption, and Defensive Dominant Region (Rao et al., 13 Apr 2026).

The trajectory splits are specified source by source: Metrica 462 k462\text{ k}24 matches for train, validation, and test; SkillCorner 462 k462\text{ k}25 matches; DFL 462 k462\text{ k}26 matches; SoccerNet-GSR 462 k462\text{ k}27 clips; and SoccerFactory 462 k462\text{ k}28 clips. The event benchmark uses a 462 k462\text{ k}29 ratio with 462 k462\text{ k}30 train, 462 k462\text{ k}31 validation, and 462 k462\text{ k}32 test clips, for 462 k462\text{ k}33 total instances across all 462 k462\text{ k}34 subtypes. Other sports are also included for cross-domain generalization: basketball with 462 k462\text{ k}35 clips, American football with 462 k462\text{ k}36 clips, and ice hockey with 462 k462\text{ k}37 clips.

GenTac uses TacBench to demonstrate four capabilities. For geometric and structural accuracy in unconditioned forecasting, min-ADE grows from 462 k462\text{ k}38 at 462 k462\text{ k}39 to 462 k462\text{ k}40 at 462 k462\text{ k}41, and min-FDE from 462 k462\text{ k}42 to 462 k462\text{ k}43 with a 462 k462\text{ k}44 causal window. Structural deviations are reported as 462 k462\text{ k}45 and 462 k462\text{ k}46 over 462 k462\text{ k}47. For team and league style simulation, team conditioning for Auckland FC reduces surface area error from 462 k462\text{ k}48 to 462 k462\text{ k}49 at 462 k462\text{ k}50, and league conditioning lowers short-horizon ADE by 462 k462\text{ k}51 at 462 k462\text{ k}52. For controllable counterfactuals, offensive guidance raises off-ball expected threat by 462 k462\text{ k}53, depth threat by 462 k462\text{ k}54, and width threat by 462 k462\text{ k}55; defensive guidance lowers threat metrics, raises disruption by 462 k462\text{ k}56, and increases dominant region by 462 k462\text{ k}57. For event grounding, type top-1 accuracy is 462 k462\text{ k}58, type top-3 is 462 k462\text{ k}59, subtype top-1 is 462 k462\text{ k}60, subtype top-3 is 462 k462\text{ k}61, and subtype top-5 is 462 k462\text{ k}62. Event forecasting from generated futures is described as yielding a distribution over events and thereby capturing tactical branching.

Taken together, the two TacBench benchmarks exemplify a shared benchmarking strategy under a reused name: both define standardized evaluation protocols for models that must generalize across heterogeneous contexts, but they do so in sharply different technical regimes. In tactile sensing, TacBench isolates reusable touch representations across sensors and downstream tasks. In soccer tactics, TacBench quantifies the geometric, structural, and semantic fidelity of stochastic multi-agent forecasts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TacBench.