Timestamped Audio Captioner (TAC)

Updated 4 July 2026

Timestamped Audio Captioner (TAC) is a class of audio-language systems that generate temporally grounded descriptions with explicit timestamps for detailed event localization.
It employs methods like fine-tuning with LoRA and serialized timestamp tokens to effectively overcome issues such as overlap confusion and temporal inconsistency.
TAC advances audio processing by coupling caption generation with temporal parsing, enhancing tasks like audio event detection and multimodal reasoning.

Timestamped Audio Captioner (TAC) denotes a class of audio-language systems that generate temporally grounded descriptions with explicit timestamps rather than a single clip-level summary. In the narrower, model-specific sense, TAC is also the name of the Qwen2-Audio-based system introduced in "Timestamped Audio Captioning," which was designed to address overlap confusion, temporal inconsistency, and hallucinations in complex acoustic scenes by emitting structured, parseable, time-stamped event descriptions at controllable levels of detail (Kumar et al., 17 Feb 2026). In adjacent literature, timestamped captioning is closely related to text-to-audio grounding, frame-wise language-audio pretraining, time-marker-based generation, and long-audio reasoning with temporally grounded intermediate steps (Xu et al., 2021, Primus et al., 12 May 2025, Yang et al., 1 Jun 2026, Ghosh et al., 13 Apr 2026).

1. Task definition and scope

Timestamped captioning extends automated audio captioning from clip-level summarization to explicit temporal localization. A contemporary survey of audio captioning distinguishes standard AAC, which outputs a single sentence for a clip, from timestamped or dense captioning, which outputs a sequence of phrases with temporal spans such as $[0.0\text{–}1.5\,s]$ and $[1.5\text{–}5.0\,s]$ (Xu et al., 2022). The practical motivation is that non-speech audio, speech, music, and background ambience are rarely stationary; they overlap, recur, and change semantic salience over time.

A closely related formalization is text-to-audio grounding (TAG), in which a caption is given and the system localizes the sound-event phrases referred to by that caption with onset and offset timestamps (Xu et al., 2021). TAG is therefore a grounding task, whereas TAC couples description generation with temporal localization. The distinction matters operationally: TAG asks where a known phrase occurs; TAC asks what occurs and when.

The TAC model introduced in 2026 explicitly frames the problem around failures of large audio-LLMs trained with sparse global supervision. The reported failure modes are overlap confusion, temporal inconsistency, and hallucinations. TAC addresses these by producing time-stamped lines ordered by start time, with coarse type labels such as [music], [sfx], [speech], and [background], followed by natural-language descriptions and precise start-end timestamps (Kumar et al., 17 Feb 2026). This output format makes the model both a captioner and a temporal parser of acoustic scenes.

2. Temporal grounding mechanisms and architectural patterns

The 2026 TAC model is built by fine-tuning Qwen2-Audio with the base model frozen and LoRA applied to linear layers. Its timestamps are serialized as atomic special tokens of the form <|t|>, for example <|1.23|>, so temporal references are handled directly in the language channel rather than via a separate alignment head (Kumar et al., 17 Feb 2026). This design couples caption generation and temporal localization in a single autoregressive output stream.

Recent systems implement temporal grounding through several distinct mechanisms. AF-Next introduces Temporal Audio Chain-of-Thought, trained on AF-Think-Time, and replaces standard RoPE with Rotary Time Embeddings, using $\theta \leftarrow -\tau_i \cdot 2\pi$ rather than token-index-based rotation. Audio is processed in non-overlapping 30-second chunks, the waveform is resampled to 16 kHz mono, converted to a 128-channel log mel-spectrogram, encoded at 50 Hz, and then handled by an LLM whose context is extended from 32k to 128k tokens (Ghosh et al., 13 Apr 2026). The aim is not only captioning, but timestamp-grounded reasoning over long audio.

MOSS-Audio uses a different temporal representation. Its dedicated audio encoder produces 12.5 Hz temporal representations, and explicit time markers are inserted every 25 audio frames, corresponding to 2-second elapsed-time anchors. Two architectural choices are central: DeepStack cross-layer feature injection, which exposes the decoder to acoustic information from multiple encoder depths, and time markers, which provide explicit temporal cues inside the audio-token stream (Yang et al., 1 Jun 2026). This yields a decoder that can emit timestamped text while remaining conditioned on temporally structured audio features.

Speech-centric timestamping has also been developed through direct acoustic-to-word timing. "Timestamped Embedding-Matching Acoustic-to-Word CTC ASR" trains a word-level CTC recognizer that directly outputs word start times and durations, avoiding a secondary forced-alignment stage at test time (Jeon, 2023). Although this is an ASR system rather than a general audio captioner, it demonstrates that timestamp production can be integrated into primary sequence decoding rather than appended as post-processing. A plausible implication is that speech-heavy TAC systems can inherit timestamp robustness from ASR-style timing formulations while extending the text output beyond verbatim transcripts.

3. Data construction and training supervision

TAC’s central methodological contribution is its synthetic data pipeline. A dynamic acoustic mixer constructs realistic polyphonic mixtures from licensed single-source audio, using scene templates with role bindings for speech, music, sound effects, and backgrounds. For each event, the system computes a continuous activity map from RMS energy, applies a per-example activity threshold $d_{act}$ , merges adjacent active segments with a per-example merge threshold $d_{merge}$ , rounds timestamps to a target resolution $d_{res}$ , and then formats style-controlled captions under a prompt that exposes these timing parameters (Kumar et al., 17 Feb 2026). The model therefore sees not only timestamps, but a distribution over timestamp granularities and descriptive styles.

The target supervision is hierarchical. TAC samples KEYWORDS, BRIEF, or DETAILED style, constructs a reasoning header from composition metadata, and then emits time-stamped event lines. Training uses a timestamp-weighted objective,

$L_{total} = L_{LM} + \alpha_{time} \sum_{t \in \mathcal{T}_{time}} CE(y_t, \hat{y}_t),$

so timestamp token errors are explicitly upweighted relative to ordinary language-model tokens (Kumar et al., 17 Feb 2026). This differs from standard AAC training, where time is usually implicit or absent.

Other timestamped-captioning pipelines use stronger direct temporal supervision from real data. TACOS curates approximately 12,000 Freesound recordings with single-sentence free-text descriptions aligned to specific temporal regions, yielding 12,358 clips and 47,748 annotated regions. Its training strategy is frame-wise contrastive: audio frame embeddings are aligned with caption embeddings only within the annotated segment, rather than after global pooling (Primus et al., 12 May 2025). This directly targets frame-level text-audio correspondence.

AF-Next scales data volume rather than relying on a single supervision source. It expands AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat into a corpus totaling approximately 108M samples and approximately 1M hours, including 27k hours and 290K pairs for Long Audio Captioning and 3954 hours with 43K thinking-chain triplets for AF-Think-Time (Ghosh et al., 13 Apr 2026). MOSS-Audio, by contrast, builds an event-preserving annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation for speech, music, and general audio, and merges the branches into unified captions while retaining the intermediate branch outputs for task-oriented SFT (Yang et al., 1 Jun 2026). These alternatives show that timestamp supervision can come from synthetic mixtures, human-aligned temporal regions, or tool-assisted branch annotation.

4. Inference, controllability, and output structure

At inference time, TAC is instruction-controlled. The model accepts a prompt header specifying style and timing settings; the reported recommended configuration is [style=brief, activity=0.05, resolution=0.10s, merge=0.25s], which is described as balancing precision, recall, and hallucination (Kumar et al., 17 Feb 2026). This means temporal resolution and descriptive density are runtime controls rather than fixed properties of the checkpoint.

The output is line-oriented and overlap-aware. TAC emits separate lines for distinct events, so concurrent events remain disentangled instead of being collapsed into a single narrative sentence. Speech spans can be post-processed with Whisper to insert a <speech>...</speech> transcript inside the same time span, enriching the event line with lexical content while preserving the original temporal bracket (Kumar et al., 17 Feb 2026). This is especially relevant in mixed speech-and-sound scenes, where verbatim speech and non-speech event description coexist.

Other systems expose different inference constraints. AF-Next-Captioner is the recommended checkpoint for TAC-style timestamped captioning within the AF-Next family, but the paper does not specify a canonical timestamp format, explicit decoding parameters, or structured timestamp tokens, so formatting must be defined externally at the application layer (Ghosh et al., 13 Apr 2026). MOSS-Audio instead advocates explicit time-token serialization and constrained decoding with non-decreasing time tokens, including bracketed [start]text[end] styles and post-decoding monotonicity enforcement (Yang et al., 1 Jun 2026). This suggests two broad operating modes for TAC systems: timestamp-as-language-token and timestamp-as-postprocessed structure.

5. Evaluation protocols and empirical results

Dense timestamped captioning remains evaluated by a heterogeneous mix of temporal, semantic, and hallucination metrics. On TACOS, the TAC model reports event-based F1, segment-based F1, hallucination rate, confidence, and specificity, and outperforms the compared captioning baselines on all five reported dimensions (Kumar et al., 17 Feb 2026).

System	Benchmark	Reported result
TAC	TACOS	EvtF1 0.50, SegF1 0.71, Hal 4.9%, conf 0.89, spec 0.74
Gemini 3 Pro	TACOS	EvtF1 0.42, SegF1 0.64, Hal 6.1%, conf 0.84, spec 0.66
Qwen3-Omni	TACOS	EvtF1 0.37, SegF1 0.66, Hal 7.3%, conf 0.84, spec 0.62
Audio Flamingo 3	TACOS	EvtF1 0.27, SegF1 0.55, Hal 11.6%, conf 0.73, spec 0.59

The same work reports that multitask prompts are critical, that removing real-world dense annotations reduces EvtF1 to 0.42, that LoRA rank $r=128$ is optimal among the tested settings, and that $\alpha_{time}=5$ gives the best balance between temporal grounding and hallucination (Kumar et al., 17 Feb 2026). These ablations indicate that timestamp conditioning, synthetic overlap structure, and real dense references all contribute materially to performance.

Earlier grounding-oriented baselines are much weaker. TAG reports an event-F1 score of 28.3% and a PSDS score of 14.7% on AudioGrounding, with a qualitative finding that the baseline is often phrase-insensitive (Xu et al., 2021). TACOS shows that stronger temporal supervision improves text-based sound event detection substantially: strong fine-tuning reaches PSDS1 $= 17.99 \pm 0.10$ and pAUROC $[1.5\text{–}5.0\,s]$ 0, compared with PSDS1 $[1.5\text{–}5.0\,s]$ 1 and pAUROC $[1.5\text{–}5.0\,s]$ 2 under weak fine-tuning (Primus et al., 12 May 2025).

A persistent evaluation issue is that not all long-audio models report dedicated timestamp metrics. AF-Next provides captioning results such as CIDEr 0.52 on Clotho-v2 and 0.74 on AudioCaps for AF-Next-Instruct, but explicitly states that timestamp alignment metrics and a dedicated TAC benchmark are left to future work (Ghosh et al., 13 Apr 2026). This makes cross-paper comparison difficult: some systems optimize event-level timing directly, while others demonstrate temporal capability primarily through downstream reasoning or qualitative examples.

6. Extensions, adjacent systems, and unresolved issues

TAC has already been extended into multimodal and reasoning-heavy pipelines. TAC-V combines TAC outputs over 20-second audio chunks with frame sampling at 2 fps, uses FLAM confidence scores for audio events, and feeds the resulting shot-list into Qwen3-VL-32B with a chain-of-thought prompt for visual grounding and hallucination correction. In this describe-then-reason regime, TAC-V→Gemini reaches 77.9% on Daily-Omni, while TAC→Gemini 3 Pro reaches 71.9% on MMAR, 72.4% on MMSU, and 62.9% on MMAU-Pro (Kumar et al., 17 Feb 2026). The central claim is that dense, temporally grounded text acts as a semantic bridge for stronger downstream reasoners.

Audio-visual dense captioning has also adopted script-like structural outputs. TimeChat-Captioner introduces Omni Dense Captioning with a six-dimensional schema—Events, Background, Camera, ShotEdit, Dialogue, and Acoustic—plus the SodaM metric, which aligns predictions and references through dynamic programming on temporal IoU before scoring caption coverage. On OmniDCBench, TimeChat-Captioner-7B-GRPO reports F1 61.2, mIoU 69.6, and SodaM 35.0, surpassing Gemini-2.5-Pro on SodaM (Yao et al., 9 Feb 2026). TCA-Captioner focuses on temporal and cross-modal alignment for audiovisual video captioning, using the Observer-Checker-Corrector framework and TCA-Bench; it reports AV Binding total 73.2 and AV Temporal total 76.9, rising to 75.3 and 79.2 after SFT+DPO (Zhao et al., 2 Jul 2026).

Specialized speech and tool-augmented variants push TAC into neighboring tasks. Speaker-Reasoner addresses timestamped speaker-attributed ASR with agentic multi-turn temporal reasoning and a speaker-aware cache, reaching DER 5.26%, CER 13.83%, and cpCER 14.73% on AISHELL4-Eval in its Stage 3 configuration (Lin et al., 3 Apr 2026). Audio-Maestro instead delegates low-level analysis to external tools such as ASR, diarization, VAD/SNR, sound-duration analysis, and chord recognition, then injects structured timestamped JSON into a large audio-LLM; on MMAU-Test, average accuracy rises from 67.4% to 72.1% for Gemini-2.5-flash, from 58.3% to 62.8% for DeSTA-2.5, and from 60.8% to 63.9% for GPT-4o (Lee et al., 13 Oct 2025).

Several limitations remain recurrent across the literature. AF-Next notes that long-audio reasoning remains challenging when evidence is temporally distant, sparse, or distributed across multiple segments, and that low-resource languages and rare events are underrepresented (Ghosh et al., 13 Apr 2026). TCA-Captioner reports persistent binding errors in highly overlapping ambient scenes and temporal mistakes when speech overlaps fast actions (Zhao et al., 2 Jul 2026). MOSS-Audio’s 12.5 Hz encoder implies an 80 ms temporal grid, so sub-80 ms events cannot be localized exactly (Yang et al., 1 Jun 2026). TAC itself identifies sim-to-real gap, domain coverage limits, and privacy risks arising from fine-grained event detection (Kumar et al., 17 Feb 2026). Taken together, these results suggest that future TAC research will continue to revolve around stronger temporal evaluation, richer long-context memory, overlap handling, and explicit control over how timestamps are represented and decoded.