JEDIS-LLM: Streamable Diarization & ASR

Updated 27 November 2025

The paper introduces a unified streaming model for joint ASR and speaker diarization using a Speaker Prompt Cache that ensures consistent speaker identities chunk-by-chunk.
The methodology employs an end-to-end Speech-LLM trained on short audio segments with dynamic SPC updates to achieve improvements in cpWER and WDER metrics.
The approach effectively resolves the speaker permutation problem in real-time, enabling zero-shot inference on lengthy recordings and outperforming traditional offline pipelines.

Joint Streamable Diarization and ASR (JEDIS-LLM) refers to a unified approach for performing automatic speech recognition (ASR) and speaker diarization in a streaming, low-latency fashion over long audio, leveraging end-to-end speech LLMs (Speech-LLMs) trained on short audio. The key enabling mechanism is the Speaker Prompt Cache (SPC), which allows the model to resolve speaker identities chunk-by-chunk, maintain consistency, and enable zero-shot operation on long recordings. This architecture fundamentally departs from cascaded, offline diarization+ASR pipelines by retaining full online streamability without explicit speaker-permutation alignment steps (Shi et al., 20 Nov 2025).

1. Problem Definition and Motivation

The task of joint ASR and diarization is to answer "who spoke what," especially critical for multi-speaker environments such as meetings or conversational analysis. In a streaming context, audio must be processed incrementally as it arrives, precluding traditional approaches that segment speakers globally after the entire recording is available. A core challenge is the speaker permutation problem: speaker indices might switch arbitrarily between chunks when processing is chunkwise and independent, leading to identity ambiguity (Shi et al., 20 Nov 2025). Legacy solutions perform global clustering or enter complex offline reconciliation; these are incompatible with low-latency, online inference.

JEDIS-LLM addresses this by persistently tracking and conditioning on each known speaker using a Speaker Prompt Cache. This mechanism preserves the assignment and order of speaker identities, even as the system operates over arbitrarily long, streaming inputs and was only trained on short segments.

2. Speaker Prompt Cache: Formalism and Operational Mechanics

The Speaker Prompt Cache (SPC) is a persistent, per-session collection recording, for every active speaker s, a triplet:

C[s].audio: short utterance audio clip (≤ ℓ seconds)
C[s].text: corresponding speaker-attributed transcript snippet
C[s].dvec: speaker embedding (d-vector), from a speaker-verification model

These entries are indexed by order of appearance ("first-seen order"), and are updated on-the-fly during streaming inference (Shi et al., 20 Nov 2025). When processing any new audio chunk, model input is constructed by concatenating all cached audio and text for speakers s=1,…,S (preserving their established order), followed by the new chunk. As a result, the autoregressive Speech-LLM continues to produce consistent speaker-indices assignment across chunks, preventing permutation errors.

Update mechanism: At each chunk, for all speakers detected in the output, the cache may update an entry if:

The previously cached text is too short or lacks punctuation, and the new candidate is longer and more complete
The d-vector similarity σ = CosineSimilarity(v_new, v_old) exceeds a threshold θ (typically θ=0.7)

Ablation experiments confirm the critical value of updating: updating the cache during inference yields a substantial ∼0.4% absolute improvement in chunk-permuted word error rate (cpWER) over static SPCs (Shi et al., 20 Nov 2025).

Initialization and Profiles: SPC can be initialized by pre-enrolling speaker profiles (manually selected utterances and transcripts), supporting scenarios where speakers are known ahead of time. In this case, cache entries remain fixed without update, greatly reducing speaker ID mismatches (cpWER–SA-WER gap decreases to 2.07% with profiles vs. 7.78% without) (Shi et al., 20 Nov 2025).

3. Inference Pipeline and Data Structures

The chunkwise streamable inference pipeline with SPC is as follows:

Chunking: Long-form audio A is segmented into sequential chunks.
Preparation: For chunk aₖ, construct inputs:
- Audio: concatenation of all C[s].audio (for s in established speaker order) and aₖ
- Text prompt: concatenation of a base prompt P and all C[s].text (same order)
Recognition: The Speech-LLM ingests (audio, text prompt) and autoregressively outputs the speaker-attributed transcript for aₖ.
Cache Update: For every speaker s appearing in the transcript:
- Compute word-level alignment to select a representative, non-overlapping audio snippet (≤ ℓ seconds), and its transcript.
- If s is not in cache, add entry. Else, consider updating as per content length and d-vector similarity.
Loop: Continue for each chunk.

No explicit vector search or clustering is performed at inference; the strict order and inclusion mechanism ensures downstream speaker-index consistency. If pre-enrolled profiles are used, cache remains fixed (Shi et al., 20 Nov 2025).

4. Training Methodology and Loss Functions

JEDIS-LLM is trained end-to-end on short (<20s) audio segments with two explicit objectives to improve ASR/diarization capability:

LLM Token Prediction Loss: Standard cross-entropy over speaker-attributed transcription tokens, with speech encoder output, text prompt embedding, and segment-level speaker-labels.
Word-Level Speaker Supervision: An auxiliary Spk-Decoder predicts the speaker identity at the word level. Inputs are the speech encoder’s hidden output and a tokenized sequence with each word replaced by the corresponding speaker ID. Loss is cross-entropy over speaker-IDs.

The total loss is a weighted sum:

$L = \mu \cdot L_{\textrm{LLM}} + (1 – \mu) \cdot L_{\textrm{Spk}}, \quad \mu=0.5$

Training only on short-form audio, yet achieving zero-shot streaming on long audio is made possible by the SPC mechanism (Shi et al., 20 Nov 2025).

5. Empirical Evaluation and Gains from SPC

Key results from (Shi et al., 20 Nov 2025):

Configuration	WDER (%)	cpWER (%)	SA-WER (%)	cpWER–SA-WER
Baseline: offline chunk+clustering	2.48	19.03	—	—
Streaming + SPC (no update)	2.09	18.58	—	—
Streaming + SPC, updates (θ=0.7, ℓ=5s)	1.73	18.20	25.98	7.78
+ Pre-enrolled profiles (no update)	—	17.91	19.98	2.07

SPC with updates achieves the lowest cpWER and WDER, with pre-enrolled profiles yielding further improvement in speaker–ID alignment. Shorter chunks (≤5s) hurt cpWER (23.38%), indicating update granularity is critical. The results support that a properly designed SPC is essential for high-accuracy, low-latency, fully streamable joint ASR and diarization.

6. Relation to Other Speaker Prompt Cache Architectures

SPC is a unifying abstraction appearing in varied speech systems:

Streaming Sortformer (Medennikov et al., 24 Jul 2025) introduces the Arrival-Order Speaker Cache (AOSC), dynamically caching high-quality acoustic embeddings per speaker, sorted by arrival order. The mechanism obviates the need for permutation resolution in diarization by prompting the model with arrival-ordered embeddings, closely paralleling the SPC usage in JEDIS-LLM for speaker-index consistency.
Visual Speech Recognition (VSR) Prompt Tuning (Kim et al., 2023) uses SPC to store per-speaker adaptation prompts (addition, padding, concatenation) for fast, memory-efficient adaptation, showing strong performance gains over full fine-tuning with <1% memory overhead.
Speech Understanding on Tiny Devices (Benazir et al., 2023) implements a two-level SPC (raw sound units, phoneme sequences) for on-device inference, maximizing local resolution rates and minimizing cloud latency.
SPC in LLM-based Dialogue and Serving (Gim et al., 2023, Srivatsa et al., 8 May 2024) applies the concept to modular caching and scheduling of prompt segments, including speaker turns, in LLMs and distributed serving backends.

A common pattern is the cache’s critical role in maintaining identity (acoustic, phonetic, or symbolic) over streaming or session-based interaction, for accuracy and computational efficiency.

7. Limitations and Future Directions

Correlation with Human Judgments: Existing evaluation models for cache utility and coherence (such as UniEval, G-Eval) demonstrate only moderate alignment (∼0.6 Spearman) with human judgments, motivating improved filtering and selection strategies (Atkins et al., 26 Jun 2024).
Cache Growth and Management: Unbounded SPC growth necessitates future work on memory-bound eviction (LRU, LFU, frequency-based) and tailored embedding pruning. Hierarchical or radix-tree organization, as in large-scale distributed serving (Preble (Srivatsa et al., 8 May 2024)), suggests a path for scalable SPC management.
Granularity and Update Frequency: The effectiveness of SPC depends critically on the segment and cache update schedule. Short chunks impair the cache's representational power due to lower content, while overly long cache entries raise memory and relevance trade-offs (Shi et al., 20 Nov 2025).
Extension to Other Modalities: The SPC framework is extensible beyond ASR/diarization to modalities such as visual speech (VSR (Kim et al., 2023)) and text-based dialogue (Prompt Cache (Gim et al., 2023)), and can generalize to multi-party analytics, live translation, or meeting summarization (Medennikov et al., 24 Jul 2025).
Distributed Scheduling: Industrial-scale systems benefit from distributed prompt caching and E² scheduling, where each “speaker context” segment (e.g., per-user or per-session prompt) is tracked and mapped efficiently across resources (Srivatsa et al., 8 May 2024).

A plausible implication is that as Speech-LLMs and LLM-based dialogue systems continue to scale, robust SPC architectures—covering both low-level representations and high-level prompt segments—will become central to efficient, real-time multi-speaker AI.

Key References:

"Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio" (Shi et al., 20 Nov 2025)
"Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering" (Medennikov et al., 24 Jul 2025)
"Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition" (Kim et al., 2023)
"Speech Understanding on Tiny Devices with A Learning Cache" (Benazir et al., 2023)
"Prompt Cache: Modular Attention Reuse for Low-Latency Inference" (Gim et al., 2023)
"Preble: Efficient Distributed Prompt Scheduling for LLM Serving" (Srivatsa et al., 8 May 2024)
"ConvoCache: Smart Re-Use of Chatbot Responses" (Atkins et al., 26 Jun 2024)