AI-generated podcasts: Synthetic Intimacy and Cultural Translation in NotebookLM's Audio Overviews (2511.08654v1)
Abstract: This paper analyses AI-generated podcasts produced by Google's NotebookLM, which generates audio podcasts with two chatty AI hosts discussing whichever documents a user uploads. While AI-generated podcasts have been discussed as tools, for instance in medical education, they have not yet been analysed as media. By uploading different types of text and analysing the generated outputs I show how the podcasts' structure is built around a fixed template. I also find that NotebookLM not only translates texts from other languages into a perky standardised Mid-Western American accent, it also translates cultural contexts to a white, educated, middle-class American default. This is a distinct development in how publics are shaped by media, marking a departure from the multiple public spheres that scholars have described in human podcasting from the early 2000s until today, where hosts spoke to specific communities and responded to listener comments, to an abstraction of the podcast genre.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper looks at a new kind of media: AI-made podcasts generated by Google’s NotebookLM. These podcasts always have two friendly AI hosts who chat excitedly about documents you upload. The main idea of the paper is that, even though these podcasts feel personal, they actually follow a fixed pattern and turn different cultures and voices into one standard style: a cheerful, white, middle-class, American way of talking. The author calls this “synthetic intimacy” — it sounds warm and close, but the connection isn’t real or specific to the listener’s community.
Questions the paper asks
- What do AI-generated podcasts sound like and how are they structured?
- Do they truly reflect the source documents, or do they reshape them into a generic, American voice?
- How are these AI podcasts different from human-made podcasts that speak to specific communities?
- What does this mean for how media shapes public conversations and culture?
How the research was done
The researcher used NotebookLM’s “Deep Dive” podcast feature and gave it different kinds of documents to talk about. This approach is called using “synthetic probes,” which is like poking the system with a variety of inputs to see what it does and what patterns it reveals.
To keep it fair and clear, the researcher:
- Uploaded documents from very different cultures, languages, and time periods.
- Generated multiple podcasts from the same type of input.
- Collected the audio and transcripts to paper the structure and wording.
Examples of documents used:
- Norwegian university meeting papers (Norwegian academia).
- A Norwegian joke that relies on a pun.
- A blog written in African American Vernacular English (AAVE) about hip hop.
- An old chemistry book from 1817 written as a teaching dialogue.
- An entirely empty PDF (no content at all).
Simple explanations of technical terms:
- LLM: A computer system trained on huge amounts of text to predict words and generate language (Google’s “Gemini” is one of these).
- RAG (Retrieval-Augmented Generation): The AI looks at your uploaded sources while also using its general knowledge to answer or generate content.
- Synthetic probes: Inputs designed to reveal how the AI behaves, by testing it with unusual or varied materials.
- AAVE: A recognized variety of English often spoken in Black communities in the U.S.
- Hallucination: When AI makes up information that wasn’t in the source.
What did the researcher find?
A fixed template behind the “chatty” style
No matter what you upload, the AI hosts use the same upbeat, conversational format with a male-sounding and female-sounding voice. They interrupt with “yeah,” “right,” and “exactly,” and they end by asking you to keep asking questions — even if the source doesn’t call for that tone. When the empty PDF was used, the AI slipped and revealed parts of its script, like “topic redacted, awaiting remaining source material,” and even inserted a dramatic quote that didn’t exist. This shows the podcasts are built from a pre-set formula.
Language and culture get translated to a U.S. default
The AI turns any input into Standard American English with a Midwestern vibe, and often treats the content as if it belongs in a white, educated, middle-class American context. Examples:
- It described a Norwegian public university’s budget as relying mostly on tuition and donations (typical in the U.S.), which is wrong for Norway where funding is mainly public.
- It translated AAVE into standard English, removing the style and voice of the original community.
- It turned a Norwegian pun-based joke into an “ancient wish-fulfillment tale,” missing the joke’s double meaning entirely.
“Synthetic intimacy” feels friendly but isn’t grounded
The AI uses relationship-building tricks like saying “I see what you mean” or “I’m all ears” to feel close to the listener. But unlike human podcasts, it doesn’t have a real community or shared context. The warmth is simulated. It’s intimacy without local knowledge, without listener feedback, and without real cultural roots.
The empty-PDF test exposes the structure
When there was no content, the AI still followed its pattern: introduce a “topic,” include a “memorable quote,” build a sense of mystery, and end with an invitation to engage. Sometimes the system even “hallucinated” a poetic line to fit the template, proving the structure drives the show as much as (or more than) the source.
Different from human podcasts and the “publics” they serve
Human podcasts often form many small “publics” — communities that share interests and backgrounds, and often talk back via comments and social media. Early podcasting helped diverse groups build voices. In contrast, the AI-generated podcasts pull niche topics into a single generic style, pushing toward a one-size-fits-all “public” again. That can erase differences and reduce the richness of many voices.
Why this matters
- Media shapes how people see the world. If AI podcasts always speak in the same cultural voice, they can make other cultures and histories sound less valid or less visible.
- As companies can generate thousands of AI podcast episodes cheaply to make ad money, generic AI talk may flood platforms, making it harder for unique human voices to be heard.
- Teachers, students, and professionals may use these tools believing they’re custom to their needs, but they might get a standardized, U.S.-centric version of their own materials.
Key terms explained simply
- Public sphere: The space where people share news, opinions, and debate. Think of it as the big conversation a society has with itself.
- Multiple public spheres: Many smaller, overlapping communities and conversations (like different fandoms, local groups, or cultural communities) instead of just one national conversation.
- Code-switching: Changing how you speak depending on your audience to fit into different groups.
- Hallucination (AI): The AI invents details that aren’t in the source.
Implications and potential impact
This research suggests that AI-generated podcasts, while convenient and friendly, may:
- Flatten cultural differences by translating every source into the same voice and outlook.
- Reduce the diversity of media spaces by replacing community-specific voices with standardized ones.
- Encourage audiences to accept “synthetic intimacy” as real connection, even when the show doesn’t truly understand the listener’s culture or context.
- Push the podcast world back toward a single, mass “public,” rather than supporting many communities.
In short, AI podcasts feel personal, but they often aren’t. To use them wisely, we should be aware of their templates, their cultural defaults, and their limits — and keep supporting human-made podcasts that speak from and to real communities.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The paper makes a strong, exploratory case for “synthetic intimacy” and cultural universalization in NotebookLM’s AI-generated podcasts, but it leaves several methodological, empirical, and theoretical questions unresolved. The following list identifies concrete gaps and actionable open questions for future research.
Methodological scope and reproducibility
- Small, convenience sample of probes (Norwegian meeting docs, one joke, one AAVE blog post, one 1817 text, 20 empty-PDF runs) limits generalizability; no sampling frame or saturation criteria specified.
- Lack of documented system parameters (Gemini version, model updates, temperature, prompt settings, seed), hindering reproducibility and comparability across runs.
- No inter-rater coding or systematic content analysis protocols to quantify themes (e.g., template markers, topic switchers, end-of-show prompts); largely qualitative and anecdotal.
- Temporal drift not assessed: findings may depend on specific service versions; no longitudinal replication across updates despite noting that new podcast styles were later added.
- Reliance on a single user’s interactions; no multi-user replication to check robustness across accounts, locales, or device configurations.
Comparative and cross-platform analyses
- No head-to-head comparison with other podcast generators (e.g., ElevenLabs, Skywork) to test whether “synthetic intimacy” and US-centric normalization are NotebookLM-specific or genre-wide.
- No comparison with human-produced podcasts matched by topic and audience niche to empirically test differences in situatedness, discourse structure, or community signaling.
- No examination of NotebookLM’s additional podcast styles introduced after the initial “Deep Dive” mode (e.g., do alternative styles reduce cultural homogenization?).
Cultural translation and accent evidence
- “Midwestern/white, educated, middle-class American” vocal default is asserted but not empirically verified (no perceptual studies, acoustic/phonetic analysis, or listener accent-identification tests).
- Cultural translation beyond Norwegian and AAVE is untested; no systematic cross-linguistic/cross-cultural paper (e.g., non-European languages, indigenous languages, Global South contexts).
- No tests of steerability: whether explicit prompts, voice options, or locale settings can preserve original dialects, accents, code-switching, or culturally-specific registers.
- No back-translation or parallel-text evaluation to measure semantic loss, register shift, or style drift during cultural/linguistic translation.
Audience reception and societal impact
- No listener studies to assess whether audiences perceive “synthetic intimacy,” recognize US-centric framing, or experience confusion/mistrust due to hallucinations and template artifacts.
- No assessment of how cultural erasure or code-switching “normalization” impacts the communities whose texts are translated (e.g., AAVE speakers, multilingual audiences).
- Unclear effects on publics: Does the abstracted genre alter listener participation, comment cultures, or the formation of multiple public spheres compared to human podcasts?
Technical behavior and content fidelity
- Hallucination rates and conditions are undocumented (e.g., frequency of fabricated quotes, Americanization errors like tuition funding in Norwegian context, topic-switchers); no error taxonomy.
- RAG behavior not quantified: extent to which generated podcasts rely on uploaded sources vs. model priors; no citation density measures, quote fidelity checks, or source attribution audits.
- Template inference is based on a few empty-PDF outputs; no systematic reverse-engineering of prompt scaffolds across larger runs or after model updates.
- Speaker identity and voice similarity claims rely on Otter.ai’s diarization failure; no independent speaker separation or acoustic similarity analysis to substantiate “near-identical” voices.
Governance, ethics, and policy
- Privacy and provenance unclear: when and why does NotebookLM redact entity names, and what are the risks of PII leakage or erroneous redaction?
- No analysis of disclosure practices (e.g., synthetic voice labeling, watermarking) and their effects on trust, especially given the paper’s “obscured conduit” concern.
- Legal/IP questions (rights to transform/translate uploaded texts; liability for hallucinated claims; fair use across jurisdictions) are unaddressed.
- Platform economics and incentives (ad revenue optimization, content moderation, recommender amplification of AI podcasts) not empirically examined despite industry-scale examples.
Design interventions and mitigation strategies
- No tested interventions to preserve situatedness (e.g., preserving dialects, embedding paratexts/metadata into audio, locale-aware framing, or source-linked chapter markers).
- No evaluation of transparency UX (e.g., inline source citations read aloud, confidence cues, or “explain-my-translation” segments) to counteract synthetic intimacy and hidden homogenization.
- No guidelines or benchmarks for “situated AI audio” (metrics, checklists, or audits) that developers could adopt to reduce cultural flattening and template overreach.
- Open question: Can RLHF or controllable generation be tuned to maintain cultural specificity without harming intelligibility, and how should success be measured?
Ecosystem-level consequences
- Unknown impacts on podcast diversity and creator economies if AI-generated shows scale (e.g., displacement of niche human podcasts, discoverability effects, audience fragmentation).
- No analysis of listener behavior when AI podcasts coexist with human shows (subscription, retention, commenting, willingness to pay, or trust dynamics).
- Unclear long-term effects on knowledge circulation and public discourse if universalizing templates become the default mediation layer for niche content.
These gaps suggest a mixed-methods research agenda combining large-scale content audits, controlled experiments (perception, comprehension, trust), cross-platform comparisons, acoustic and linguistic analyses, and policy/UX intervention studies to test concrete mitigation strategies.
Practical Applications
Overview
Below are practical applications derived from the paper’s findings (synthetic intimacy, cultural translation and Americanization of content, templated genre construction, RAG limitations and hallucinations), the methods (synthetic probes, comparative genre analysis, wordtree inspection), and observed industry dynamics (mass production of AI podcasts).
Immediate Applications
- AI media quality assurance using synthetic probes Sector: software, media platforms, enterprise QA Tools/products/workflows: a “synthetic probe” audit workflow that feeds culturally diverse and empty inputs to reveal hidden templates, default voices, hallucination patterns, and cultural Americanization; automated flags for template leakage phrases (e.g., “topic redacted, awaiting remaining source material”) and genre drifts (e.g., true-crime framing) Assumptions/dependencies: access to model outputs at scale; logging and traceability; willingness of vendors to allow probe-based auditing without violating TOS
- Cultural-context preservation in AI audio generation Sector: education, public broadcasting, localization services Tools/products/workflows: prompt and policy templates that explicitly lock locale, funding models, and governance terms (e.g., “do not translate Norwegian university finance to tuition-driven US context”); workflows requiring verbatim source quotes and paratext (meeting structures, itemization) to maintain situatedness Assumptions/dependencies: models support locale constraints and verbatim quoting; human-in-the-loop editors to review situated markers
- Bias detection in code-switching and vernacular translation Sector: media, DEI training, language tech Tools/products/workflows: checks that AAVE, regional dialects, and minority language content are not auto-translated to Standard American English; dashboards that compare generated transcripts to originals; policies that preserve sociolect features unless the user explicitly requests SAE Assumptions/dependencies: availability of diverse TTS/accent options; legal and ethical guidance for representing sociolects without stereotype
- Labeling and transparency for AI-generated podcasts Sector: policy, platforms, adtech Tools/products/workflows: disclosures in podcast players and feeds specifying model, voices, templates, and whether RAG sources were used; auto-generated “situatedness notes” indicating locale and cultural assumptions; ad policies to differentiate AI audio from human shows Assumptions/dependencies: platform support for metadata fields; regulatory appetite for transparency standards; industry consensus on labeling schemas
- Brand safety and misinformation controls for enterprise use Sector: media networks, marketing, finance (ad spend governance) Tools/products/workflows: preflight checks for cultural mis-situations (e.g., misreporting Norwegian university funding), hallucination detection via source-backed citations, and risk scoring of episodes before ad placement Assumptions/dependencies: robust RAG consistency checks; access to ground-truth references and locale facts
- Educational deployment with safeguards Sector: higher education, K–12, professional training Tools/products/workflows: instructors pre-edit transcripts, enforce source citations, and add “context locks” in prompts (locale, time, intended audience); classroom exercises to teach synthetic intimacy and template awareness using wordtree analysis and probe examples Assumptions/dependencies: availability of transcript editing; faculty training in media literacy; institutional policies on AI use
- Healthcare patient education with human review Sector: healthcare Tools/products/workflows: generate patient-facing audio with strict source anchoring, clinical disclaimers, and culturally tailored voices; mandatory clinician review before distribution; multilingual output that preserves local health system realities Assumptions/dependencies: regulatory compliance (HIPAA/GDPR); medical domain knowledge integration; vetted voice libraries for cultural competence
- Platform moderation of AI podcast farms Sector: platforms, trust & safety Tools/products/workflows: detection signals for mass-produced shows (uniform voices, repetitive “empty signifiers”), rate limits, and content quality thresholds; ad fraud checks for ultra-low-cost episodes Assumptions/dependencies: scalable content analysis; cooperation between platforms and ad networks; clear terms prohibiting deceptive scale tactics
- Creator guidance for prompt engineering and review Sector: creator economy, daily life Tools/products/workflows: user-friendly prompt recipes to retain local context, avoid true-crime drift, require verbatim quotes, and specify target audience; checklists to verify cultural specifics (funding, governance, slang) before publishing Assumptions/dependencies: accessible documentation; creators willing to do light editorial review
- Rapid media studies research replication Sector: academia Tools/products/workflows: reusable probe sets (empty PDFs, culturally diverse texts, historical dialogues) to compare models; open datasets of transcripts and wordtree outputs to paper “synthetic intimacy” markers and genre templates Assumptions/dependencies: IRB/ethical approvals for dataset sharing; platform permissions for derived content analysis
Long-Term Applications
- Configurable, “situated” AI podcasting platforms Sector: media technology, public service broadcasting Tools/products/workflows: systems that encode locale/time/audience metadata and constrain generation to the stated public sphere; dynamic voice banks representing regional accents and sociolects; listener feedback loops that reintroduce interactivity and community specificity Assumptions/dependencies: high-quality multilingual/multidialect TTS; new UX patterns for situatedness; model capabilities to respect hard context constraints
- Cultural translation control knobs in LLMs and TTS Sector: software, localization Tools/products/workflows: model-level switches to preserve vs normalize dialect; preview modes (“cultural translation sandbox”) that show how content changes under different cultural frames; policy layers to prevent unintended Americanization Assumptions/dependencies: model interpretability and controllability; diverse training corpora; governance around cultural representation
- Genre template inspector and prompt skeleton extractor Sector: developer tools, AI safety Tools/products/workflows: automated reverse-engineering tools that infer production templates (topic transitions, closing prompts, backchannel phrases) and visualize “conduit” layers; CI pipelines to test for template rigidity vs flexibility Assumptions/dependencies: sufficient model outputs for inference; cooperation from vendors or legal safe harbor for auditing
- RAG governance and citation fidelity metrics Sector: enterprise AI, compliance Tools/products/workflows: standardized metrics for source coverage, citation density, and locality adherence; dashboards that warn when generated claims diverge from source documents’ institutional context Assumptions/dependencies: robust retrieval systems; shared benchmarks; compliance frameworks adopted by industry
- Education LMS integration for community-centric audio Sector: education technology Tools/products/workflows: course-linked podcasts that incorporate student questions and instructor corrections; moderation layers to maintain the classroom’s public sphere; features to preserve paratext (syllabi structures, rubrics) Assumptions/dependencies: LMS vendor buy-in; privacy-compliant student interaction channels; funding for voice diversity
- Regulated healthcare audio companions Sector: healthcare Tools/products/workflows: certified, culturally competent audio programs for chronic disease management; clinical oversight, bias audits, and outcome tracking; localized versions aligned with health system structures Assumptions/dependencies: regulatory standards for genAI media; reimbursement models; interdisciplinary teams (clinicians, linguists, ethicists)
- Standards for “situatedness disclosures” and bias audits Sector: policy and regulation Tools/products/workflows: mandates to declare locale, voices, templates, and transformation choices (e.g., dialect normalization) in AI media; periodic third-party audits for cultural bias and misinformation risk Assumptions/dependencies: legislative consensus; accredited auditors; interoperable disclosure formats
- Anti-spam and ad quality grading for AI audio markets Sector: finance/adtech Tools/products/workflows: quality scoring models that penalize template-heavy, low-substance shows; verification of human oversight; pricing protections to reduce incentives for mass-produced, low-value episodes Assumptions/dependencies: shared standards across ad networks; reliable fraud detection; buyer education
- Accessibility and minority language empowerment Sector: accessibility, cultural heritage Tools/products/workflows: TTS and ASR pipelines that honor minority languages and regional variants; community-led voice libraries; workflows to avoid erasure of paratext and local norms in translation Assumptions/dependencies: community participation; funding for corpus creation; ethical voice sampling
- Ethical design guidelines for “synthetic intimacy” Sector: ethics, UX design Tools/products/workflows: consentful design practices for intimate audio (clear boundaries, non-deceptive rapport, options to dampen backchannel/enthusiasm); audits for manipulative affective patterns Assumptions/dependencies: cross-disciplinary collaboration; empirical studies of listener impact; updated platform policies
Glossary
- African American Vernacular English (AAVE): A rule-governed variety of English associated with many African American communities, with distinct phonology, grammar, and lexicon. "a very performative version of African American Vernacular English (AAVE)."
- algorithmic failure: A revealing error produced by algorithmic systems that exposes underlying mechanisms or limitations. "This could be called a glitch in the system, an algorithmic failure (Rettberg 2022) that reveals something interesting about how the system works."
- associative drift: A tendency in generative models (or texts) to meander across related motifs or ideas through associative links rather than strict logic. "This associative drift where a motif is repeated with slight difference is typical of texts generated by LLMs."
- code-switching: Shifting between languages, dialects, or registers to align with different social contexts or identities. "Another term that is particularly relevant for the translation from one form to English to another is code-switching, where people switch from one accent or sociolect to another in order to pass as members of different communities."
- empty signifiers: Signs or expressions that suggest connection or meaning but lack concrete, situated content. "these markers of connection become empty signifiers, words, phrases and vocalisations that lack the actual situatedness of human podcasts."
- experiential diversity: Variation in lived experiences within a group, enabling shared identity while accommodating individual differences. "Donison references the idea of 'experiential diversity' as being important for identity formation both as a group and as individuals"
- floating motifs: Narrative elements that recur without carrying their original function or causal role, often in generative retellings. "In an analysis of AI-generated folktales Anne Sigrid Refsum identifies 'floating motifs', like the bird that warns the protagonist of danger in the original folktale, and is present in the generated version but without having any function"
- god trick of seeing everything from nowhere: Haraway’s critique of claims to universal, disembodied objectivity. "It's Donna Haraway's 'god trick of seeing everything from nowhere' again, from 'Situated Knowledges' (1988, 581), an essay written a generation ago that I find more and more relevant with each new technological development."
- Habermasian tradition: The line of thought stemming from Jürgen Habermas about rational-critical debate in a public sphere. "Traditional broadcast radio is often discussed in terms of the shared public sphere that fosters rational democratic debate in the Habermasian tradition."
- hallucinated quote: A fabricated citation generated by an AI model that is not grounded in the source data. "The hallucinated quote is beautiful."
- intersectionality: A framework analyzing how overlapping social identities shape experiences and power dynamics. "In line with current understandings of intersectionality (Crenshaw 1989), podcast listeners can be part of multiple public spheres."
- LLM: A machine learning model trained on vast text corpora to generate and analyze human-like language. "How would the LLM translate or refactor material from another country, from another community or from another era?"
- next token prediction: The core mechanism in autoregressive LLMs that predicts the following unit of text given prior context. "influences the next token prediction,"
- paratextual genre markers: Ancillary textual features (like headings, numbering, attachments) that signal a document’s genre and context. "it also removes these paratextual genre markers and replaces them with the podcast's own genre markers."
- public sphere: A social arena where individuals collectively discuss and shape public opinion. "This allowed society to maintain the idea of a shared public sphere across a nation or even several nations."
- Retrieval Augmented Generation (RAG): A method that supplements generative models with retrieved, specific source documents to ground outputs. "Combining a genral LLM with specific defined sources is known as RAG, which stands for Retrieval Augmented Generation."
- Socratic dialogue: A pedagogical dialogic form that advances understanding through guided questioning and correction. "This is more like a Socratic dialogue than the lightweight 'Deep Dive' podcast genre"
- sociolect: A language variety associated with a particular social group, class, or community. "from one accent or sociolect to another"
- Standard American English (SAE): The codified, prestige variety of American English often treated as neutral or default. "The AI podcast hosts speak Standard American English (SAE) no matter what language the PDFs are written in."
- synthetic intimacy: A simulated sense of closeness or personal connection created by media or AI without real situated relationships. "I call this synthetic intimacy."
- synthetic probes: Purposefully crafted inputs used to elicit revealing responses about model behavior or biases. "I use a method Gabriele de Seta calls synthetic probes (De Seta 2024)."
- universalising discourse: A mode of representation that abstracts and flattens cultural specificity into a presumed neutral norm. "a homogenisation of culturally specific texts into a universalising discourse where everything is mediated through the same placeless, timeless, white, middle-class American voice"
- wordtree: A visualization that maps the branching continuations of a word across a corpus to reveal patterns of usage. "Figure 1 shows a wordtree of words that follow the word 'I' in the transcripts of the AI-generated podcasts analysed in this paper."
Collections
Sign up for free to add this paper to one or more collections.