TAC-V: Audio-Visual & Tactile Benchmark

Updated 4 July 2026

The paper on Timestamped Audio Captioning introduces TAC-V as an audio-visual pipeline that fuses temporal audio cues with visual grounding, achieving state-of-the-art scores (e.g., 77.9 on Daily-Omni).
TacVerse (TAC-V) is a vision-based tactile benchmark with 106,800 labeled tactile images supporting shape, grating, and force tasks across within-sensor and cross-sensor settings.
TAC-V also functions as an informal label in tactile-visual robotics, encompassing diverse approaches like tactile servoing, contact-aware gating, and multi-modal representation learning.

TAC-V is not a single standardized designation across the arXiv literature. In the supplied corpus it appears in two explicit senses: as TAC-V, the audio-visual extension of Timestamped Audio Captioning, and as TacVerse (TAC-V), a benchmark for cross-sensor vision-based tactile perception. In adjacent literatures, the same string is also a plausible informal or mistaken shortening of names such as Tac-VGNN, or a near-match to terms such as TacVLA, TactV, and TaC, which belong to distinct research programs in tactile robotics, hybrid vehicles, and superconducting materials, respectively (Kumar et al., 17 Feb 2026, Wei et al., 24 Jun 2026).

1. TAC-V as audio-visual timestamped captioning

In "Timestamped Audio Captioning" (Kumar et al., 17 Feb 2026), TAC-V is introduced as an audio-visual pipeline that fuses TAC’s high temporal-precision audio outputs with a visual LLM for temporally dense audio-visual captions. It is not presented as a single jointly trained end-to-end architecture, but as a cascade composed of TAC, speech transcription over TAC-detected speech spans, FLAM-based event confidence scoring, sampled video frames, visual shot markers, and a VLM, specifically Qwen3-VL-32B.

The pipeline begins by extracting audio and sampling frames at 2 fps, alternating 360p and 240p resolution. Audio is divided into 20s non-overlapping chunks and processed by TAC to produce timestamped event descriptions with tags such as [speech], [sfx], and [music]. Speech events are transcribed, and each event receives a confidence score $c \in [0,1]$ from FLAM. These outputs are then assembled into a time-ordered shot-list and augmented with visual shot boundaries before being passed to the VLM. The VLM is prompted to perform hallucination correction and visual grounding, yielding fused timestamped descriptions that can include explicitly visual events and visible sound sources (Kumar et al., 17 Feb 2026).

A defining feature of TAC-V is its role as a semantic bridge. The downstream text-only reasoner never sees the original audio or video; it receives only TAC-V’s textual timeline. In that sense, TAC-V converts raw multimodal signals into a temporally grounded intermediate representation suitable for question answering and reasoning. The paper reports strong results for the TAC-V $\rightarrow$ LLM cascade on audio-visual reasoning benchmarks, including 77.9 on Daily-Omni and 59.2 on Video-Holmes with Gemini 3, both described as state-of-the-art in the paper’s comparison table. On AVHBench, TAC-V $\rightarrow$ LLM also markedly improves over a simpler VLM $\rightarrow$ LLM baseline on several subbenchmarks, including 79.8 vs 70.8 for AVH and 76.1 vs 51.8 for VAH when Qwen3 is used as the reasoner (Kumar et al., 17 Feb 2026).

2. TacVerse (TAC-V) as a cross-sensor tactile benchmark

In "TacVerse: A Multi-Sensor Dataset and Benchmark for Cross-Sensor Vision-Based Tactile Perception" (Wei et al., 24 Jun 2026), the dataset and benchmark are explicitly identified as TacVerse (TAC-V). Here TAC-V belongs to the literature on vision-based tactile sensors (VBTSs) and refers to a controlled benchmark for measuring how well models trained on one tactile sensor transfer to another.

TacVerse contains 106,800 labelled tactile images from seven VBTSs and supports three downstream tasks: shape classification, grating classification, and force regression. The benchmark defines three settings: within-sensor training, zero-shot cross-sensor transfer, and few-shot adaptation. Shape classification is evaluated through an exhaustive $7 \times 7$ source-target matrix, while grating and force use fixed-source transfer protocols. For few-shot adaptation in force regression, the labelled target fractions are 0.5%, 1%, 2.5%, 5%, and 10% (Wei et al., 24 Jun 2026).

The main empirical conclusion is that within-sensor learning is strong, but zero-shot cross-sensor transfer degrades substantially. Shape classification is comparatively robust, whereas grating classification and force regression are more sensitive to sensor shift. For example, within-sensor grating classification from GelSightMarker to GelSightMarker reaches 0.903 accuracy, while zero-shot transfer from GelSightMarker to ViTacTip falls to 0.041. In force regression, within-sensor GelSightNoMarker $\rightarrow$ GelSightNoMarker yields RMSE 0.186 and $R^2 = 0.590$ , whereas transfer to ViTacTip yields RMSE 1.277 and $R^2 = -9.058$ (Wei et al., 24 Jun 2026).

TacVerse also includes a representation study showing that MAE pretraining provides the most consistent gains across tasks and sensors. The paper therefore uses TAC-V not as a policy or control framework, but as a dataset-and-benchmark infrastructure for studying sensor shift, data-efficient adaptation, and self-supervised tactile representation learning (Wei et al., 24 Jun 2026).

3. TAC-V as an informal label within tactile-visual robotics

Outside those two explicit usages, the supplied literature shows that “TAC-V” is often best interpreted as an informal domain label for tactile-visual systems rather than a formal acronym. Several papers are directly relevant to such a reading, even when they do not standardize the exact string.

"Tac-VGNN: A Voronoi Graph Neural Network for Pose-Based Tactile Servoing" states explicitly that the paper does not define a method named “TAC-V”; the closest relevant string is the project URL ending in tac-vgnn, and the paper notes that “TAC-V” is likely an informal truncation or mistaken shortening of Tac-VGNN (Fan et al., 2023). Tac-VGNN itself is a tactile servoing method based on a 5-layer GCN with Delaunay-graph construction and Voronoi-area node features, improving vertical-depth pose estimation by 28.57% over a vanilla GNN and yielding smoother surface-following behavior (Fan et al., 2023).

Other tactile-visual manipulation papers reinforce the same ambiguity. Vi-TacMan defines a staged tactile-visual articulated-manipulation system in which vision proposes a grasp and a coarse interaction direction, and touch refines execution through contact regulation, without explicit articulated kinematic models (Cui et al., 7 Oct 2025). TacVLA incorporates tactile tokens into a VLA model with a contact-aware gating mechanism and reports improvements of 20 percentage points in disassembly average success, 60 percentage points in in-box picking, and approximately 2.1× under visual occlusion relative to a finetuned Pi0.5 baseline (Zhang et al., 13 Mar 2026). TacMamba separates 100 Hz tactile reflex processing from a roughly 1 Hz VLA planner through a Mamba-based tactile history compressor with 0.45 ms inference latency, reaching 100% button-pressing success and 95% blind fry-packing success in its reported tasks (Wang et al., 2 Mar 2026). UniTacVLA adds a unified tactile latent space, tactile chain-of-thought supervision, future tactile prediction, and an action-tactile mixed controller for contact-rich manipulation (Zhang et al., 30 Jun 2026).

Taken literally, none of these papers standardizes “TAC-V” as a shared acronym. A plausible implication is that in robotics contexts the string often functions as a shorthand query for tactile-aware visual or vision-language-action systems, rather than as a uniquely defined term (Cui et al., 7 Oct 2025, Zhang et al., 13 Mar 2026, Wang et al., 2 Mar 2026, Zhang et al., 30 Jun 2026).

Several additional papers broaden the tactile-visual cluster around TAC-V without naming it directly. They are relevant because they define adjacent research objects—sensor hardware, representation learning, and tactile augmentation—that a TAC-V query may seek.

TransTac introduces a transparent ultraviolet-encoded visuo-tactile sensor that combines visible-spectrum visual observation with UV-marker tactile reconstruction in one binocular device. It reports approximately 21% better correspondence robustness than Hungarian matching, 83.3% zero-shot recognition accuracy on tactile images, class-center similarity rising from around 0.2 to over 0.77, and near-contact alignment error of about $2.44\,\mathrm{mm}$ (Yang et al., 3 Jun 2026). Tac-DINO argues that tactile learning should align touch with local visual patches rather than whole images, introduces the Touch3D dataset with 505 objects and 20,025 tactile contacts, and reports large gains for patch-level alignment over whole-image alignment in local-to-global retrieval (Li et al., 10 Jun 2026). TacGen makes the stronger representational claim that touch is a necessary physical evidence channel for contact-dependent properties, reporting gains over matched vision-only baselines of $\Delta R^2 = +0.5699$ for mass, $\rightarrow$ 0 for density, $\rightarrow$ 1 for hardness, and $\rightarrow$ 2 for force labels, with all confidence intervals excluding zero (Ye et al., 28 Jun 2026).

In control-oriented VLA work, TacCoRL post-trains a pretrained VLA with tactile conditioning, sim-real co-training, and simulation RL, increasing average real-world success from 50.0% to 72.5% across four bimanual contact-rich tasks (Ma et al., 10 Jun 2026). In evaluation infrastructure, TacEva provides a quantitative benchmarking framework for VBTSs across intrinsic metrics, standard performance, and robustness, while TacVerse provides a multi-sensor benchmark for cross-sensor transfer (Cong et al., 23 Sep 2025, Wei et al., 24 Jun 2026).

Taken together, these works suggest that TAC-V in the tactile-robotics sense is less a single method than a research cluster centered on tactile-visual grounding, contact-aware policy adaptation, tactile representation learning, and sensor-level benchmarking.

5. Non-tactile and orthographic collisions

The ambiguity of TAC-V is compounded by unrelated near-matches in other domains. In condensed-matter physics, TaC refers to tantalum carbide, not TAC-V. "Superconductivity in centrosymmetric topological superconductor candidate TaC" reports single-crystal TaC as a bulk, strongly coupled, low- $\rightarrow$ 3 type-II superconductor with $\rightarrow$ 4, single-gap $\rightarrow$ 5-wave behavior, and an anomalously linear $\rightarrow$ 6 (Yan et al., 2021). A separate paper on graphite-coated TaC nanocapsules interprets a TAC-V-like query as TaC-related voltage behavior, specifically an intermittent Josephson effect with low-temperature voltage and temperature oscillations in TaC/C/TaC junction networks (Geng et al., 2011). In this usage, the “V” is not part of a formal acronym but of a thematic reading such as TaC voltage.

In robotics, TactV is another near-match. "TactV: A Class of Hybrid Terrestrial/Aerial Coaxial Tilt-Rotor Vehicles" defines TactV as a compact hybrid terrestrial/aerial vehicle with a coaxial tilt-rotor system, a spherical cage, and a tiltable center of gravity for energy-saving and high-mobility ground modes (Dong et al., 2024). It is orthographically close to TAC-V, but conceptually unrelated to tactile-visual systems.

These collisions matter because short string queries on arXiv often recover them together. A TAC-V search therefore requires domain disambiguation: audio-visual captioning, tactile-visual robotics, superconducting TaC, and hybrid vehicle design are all represented in the supplied literature.

6. Disambiguation and current usage

The most precise way to use TAC-V is to reserve it for the two cases where the supplied papers explicitly formalize the term: TAC-V in timestamped audio-visual captioning and TacVerse (TAC-V) in cross-sensor tactile benchmarking. Other uses are best treated as contextual or informal.

Usage	Domain	Status in supplied literature
TAC-V	Audio-visual timestamped captioning	Explicit pipeline name (Kumar et al., 17 Feb 2026)
TacVerse (TAC-V)	Vision-based tactile benchmarking	Explicit dataset/benchmark name (Wei et al., 24 Jun 2026)
Tac-VGNN	Tactile servoing	Formal name is Tac-VGNN; “TAC-V” is not standardized (Fan et al., 2023)
TacVLA / TacMamba / UniTacVLA / Vi-TacMan	Tactile-visual manipulation and VLA	Related research cluster, not a shared TAC-V acronym (Zhang et al., 13 Mar 2026, Wang et al., 2 Mar 2026, Zhang et al., 30 Jun 2026, Cui et al., 7 Oct 2025)
TaC / TaC voltage	Superconducting materials	Orthographic and thematic near-match, not TAC-V (Yan et al., 2021, Geng et al., 2011)
TactV	Hybrid terrestrial/aerial vehicle	Orthographic near-match, unrelated expansion (Dong et al., 2024)

A practical implication is that TAC-V should usually be accompanied by a domain qualifier. In audio-language work it denotes a temporally grounded audio-visual captioning cascade. In tactile sensing it denotes a cross-sensor VBTS benchmark. In robotics more broadly it often serves only as a search-string proxy for tactile-visual or touch-and-vision methods, including selective fusion, contact-aware gating, tactile prediction, and asynchronous tactile memory. In materials or vehicle contexts, the same characters may refer to entirely different objects.

As of the supplied corpus, TAC-V is therefore best understood not as a universally stable term, but as a domain-dependent label with two formal definitions and several recurrent informal readings.