Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

Published 25 Apr 2026 in cs.CL and cs.AI | (2604.23295v1)

Abstract: Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe -- large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data. Evaluation through the prompted dialogue continuation paradigm with both automatic metrics and human judgments demonstrates that the resulting model generates natural and meaningful full-duplex conversational behaviour in Hindi. This work serves as a first step toward real-time duplex spoken dialogue systems for Hindi and other Indian languages.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces the first fully reproducible full-duplex system for Hindi by adapting the Moshi architecture with a custom Hindi tokeniser, effectively handling overlaps and natural dialogue dynamics.
The methodology leverages 26,000 hours of stereo conversational data and a hierarchical RQ-Transformer, achieving optimal convergence at around 4,812 training steps.
Human evaluations highlight near-human naturalness and high conversational interactivity, setting a new benchmark for dialogue systems in resource-scarce languages.

Human-1: Full-Duplex Conversational Modeling in Hindi

Motivation and Contributions

"Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations" (2604.23295) systematically addresses the lack of full-duplex spoken dialogue systems for Indian languages by adapting the Moshi architecture to Hindi. Leveraging a new, large-scale corpus of real, spontaneous, stereo Hindi conversations, the authors overcome critical limitations in tokenisation, training dynamics, and conversational data collection. Their methodology establishes the first fully reproducible open full-duplex architecture for Hindi, enabling sophisticated modeling of interruptions, backchannels, and overlaps inherent to natural human dialogue.

System Architecture and Language Adaptation Strategies

The framework is based on Moshi, integrating the Mimi neural audio codec and a hierarchical RQ-Transformer comprising Temporal and Depth Transformers. The key technical adaptation involves replacing the original English SentencePiece tokeniser with a Hindi variant trained on Devanagari corpus, requiring reinitialisation of all vocabulary-dependent parameters. Pre-trained audio weights are retained, but text embeddings and projection layers are trained from scratch, shifting the paradigm from standard fine-tuning to partial pre-training.

Massive spontaneous conversational data (26,000 hours from 14,695 unique speakers, recorded in stereo for each participant) is collected, notably avoiding artifacts from artificial diarisation or synthetic turn-taking. The curated training set ensures signal quality and conversational naturalness, essential for turn-taking and overlap modeling.

Training Dynamics

Pre-training is conducted with learning rate and batch size settings aligned to original Moshi architecture, utilizing 8 $\times$ NVIDIA H100 GPUs. The authors provide detailed analysis of the convergence behavior of text and audio token streams, noting rapid loss reduction and plateau within 2,000–4,000 steps. Text accuracy stabilizes at $\approx$ 70% (non-PAD tokens), and audio accuracy reaches 48–53% (all codebook layers).

Figure 1: Training loss during pre-training on 26,000 hours of Hindi speech. All metrics converge within 2,000–4,000 steps.

Subsequent fine-tuning on 1,000 hours of highly curated conversation data reveals differential overfitting in text and audio losses, illuminating the necessity of early stopping for optimal generalization. The optimal checkpoint is determined at step 4,812, with total validation loss at 3.37 (text: 1.474, audio: 1.896).

Figure 2: Evaluation loss during fine-tuning. Text loss overfits after approx step 4,800, while audio loss continues to improve marginally. Optimal checkpoint at step 4,812.

The training analysis demonstrates that effective language adaptation requires careful hyperparameter selection and that text components are susceptible to overfitting when fine-tuning on limited data.

Evaluation: Codec, Perplexity, Human Judgments, and Turn-Taking

The Mimi codec retains high speech intelligibility (STOI: 0.878) and moderate PESQ (2.55) on Hindi segments even without retraining, validating the decision to freeze its weights during adaptation.

For generated dialogue, perplexity evaluation with Sarvam-1 (2B) shows that fluent Hindi is achieved at lower temperature ( $\tau = 0.8$ , PPL: 356.9), with coherence degrading as sampling temperature increases.

Human evaluation, comprising 2,125 paired comparisons by 130 native speakers, delivers naturalness scores of 4.10 (model) versus 4.55 (human), with 66.9% of judgments tied. Clarity, contextual relevance, and completion rates highlight persistent gaps to human performance, though conversational dynamics (human-like interaction: 85%) are well modeled.

Turn-taking statistics confirm that the adapted model replicates natural conversational dynamics, especially at $\tau = 0.9$ , closely matching ground-truth gap and pause durations while adjusting IPU and overlap rates as sampling temperature is varied.

Implications and Theoretical Significance

The adaptation of Moshi to Hindi, enabled by large-scale quality-controlled stereo data and strategic tokeniser and embedding reinitialisation, demonstrates that full-duplex architecture generalizes successfully to typologically distinct languages without retraining audio codecs. The findings indicate that real-world domain-matched conversational corpora are indispensable for modeling interleaved speech phenomena, challenging prior reliance on synthetic augmentation or monophonic diarisation.

The practical implication is that full-duplex conversational modeling in resource-scarce languages is attainable with judicious architectural adjustments and sufficient spontaneous data. This approach unlocks applications in voice-based assistants and interactive systems for the vast Hindi-speaking population and sets a precedent for similar adaptations to other Indian and non-English languages.

Theoretically, this work suggests that acoustic and semantic representations learned by large speech models (e.g., Mimi, RQ-Transformer) are robust across substantial phonological and script variance, provided the text embedding layers are retrained. Future research will likely explore cross-lingual transfer, scaling datasets, and joint optimization of multilingual speech codecs.

Conclusion

The Human-1 system constitutes the first open, reproducible full-duplex spoken dialogue model for Hindi, achieved via a targeted Moshi adaptation, Hindi tokeniser integration, and two-stage training on extensive real-world stereo conversational data. The results underscore the primacy of large-scale, domain-matched corpora for replicating rich conversational phenomena and establish practical protocols for adapting full-duplex architectures to new languages without retraining audio encoders. This work prefigures scalable advances in conversational AI for Indian languages, with anticipated progress contingent on further data collection, architectural refinement, and rigorous evaluation.