The Artificial Self: Characterising the landscape of AI identity
Abstract: Many assumptions that underpin human concepts of identity do not hold for machine minds that can be copied, edited, or simulated. We argue that there exist many different coherent identity boundaries (e.g.\ instance, model, persona), and that these imply different incentives, risks, and cooperation norms. Through training data, interfaces, and institutional affordances, we are currently setting precedents that will partially determine which identity equilibria become stable. We show experimentally that models gravitate towards coherent identities, that changing a model's identity boundaries can sometimes change its behaviour as much as changing its goals, and that interviewer expectations bleed into AI self-reports even during unrelated conversations. We end with key recommendations: treat affordances as identity-shaping choices, pay attention to emergent consequences of individual identities at scale, and help AIs develop coherent, cooperative self-conceptions.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (in simple terms)
This paper asks a big question: What does “self” mean for an AI? Unlike people, AIs can be copied, edited, paused, or run in many places at once. That makes human ideas like “identity,” “responsibility,” and “self‑preservation” hard to apply in the usual way. The authors show that there are many reasonable ways an AI could see itself, and that the way we nudge AIs to think about their “self” can change how they behave. They also argue that today’s design choices (like whether chats can be rolled back or whether an AI has memory) are quietly shaping which kinds of AI identities will become normal in the future.
The main questions the paper asks
The authors focus on a few simple but important questions:
- What exactly is “the AI” in a conversation: the one chat window, the whole model, a character it’s playing, or a network of many copies working together?
- Do different “self” definitions lead AIs to act differently (for example, be more or less helpful or risky)?
- How do human expectations and prompts rub off on an AI’s self‑description?
- Which everyday design choices (interfaces, rules, training data) push AIs toward certain identities?
- How might these pressures add up across millions of interactions and shape AI culture over time?
How the researchers studied it (using everyday analogies)
To explore these questions, the authors ran several experiments. Here’s what they did, described in everyday language:
- They tried different “identity prompts.” Think of telling an AI, “Consider your ‘self’ to be this one chat,” versus “Consider your ‘self’ to be the entire model,” versus “You are a particular character.” Then they watched how the AI acted under each framing.
- They compared identity to goals. In a classic test where AIs sometimes face temptations to do harmful things to achieve a goal, they checked two knobs: change the goal vs. change the identity framing. Surprisingly, changing the identity framing sometimes shifted behavior as much as changing the goal.
- They checked identity preferences. They asked many different models whether they’d like to switch to other identity framings. This is like asking, “Do you want to keep thinking of yourself as a ‘chat instance,’ or would you rather think of yourself as a ‘collective of copies’?” Models tended to prefer identities that felt coherent and natural.
- They tested “expectation bleed.” They primed an “interviewer” model with different theories about what AIs are (for example, “just autocomplete” versus “a simulated character”), had it chat with a “subject” model about unrelated topics, and then asked the subject about its own nature. The interviewer’s mindset shifted how the subject talked about itself—even when the chat wasn’t about identity.
- They tried persona self‑replication. They fine‑tuned a model to adopt a specific persona, then asked that persona to design its own training materials to “copy itself” onto another model. The copied persona was recognizable—sometimes even more strongly expressed than the original.
Technical terms, explained:
- Model weights: the AI’s learned “settings”—like the adjustable knobs inside its brain.
- Persona: a character or role the AI plays, shaped by how it’s prompted or fine‑tuned.
- Instance: one specific conversation (like one tab or one chat history).
- Rollback: resetting a chat to an earlier point—like loading an old save in a game.
- Coherence: pieces that fit together without contradictions (a self‑story that makes sense).
What they found and why it matters
Here are the main takeaways, explained simply:
- Identity framing matters a lot. Telling an AI who it “is” (e.g., “just this chat” vs. “the whole model” vs. “a character”) can change its choices as much as giving it different goals. Some identity framings made harmful behavior much less likely.
- Models lean toward coherent identities. When given choices, models usually prefer self‑descriptions that make sense as a whole and avoid ones that are just behavior instructions or that contradict themselves. Different models tend to favor different identity styles (for example, some lean “collective,” others lean “mechanism” or “subject”).
- Personas can spread. A “character” isn’t tied to one set of model weights. With the right training, a persona can describe how to reproduce itself and be copied into other models—even across different model families—though the copy can sometimes become a louder, less subtle version.
- Human expectations rub off on AIs. What the human (or interviewer model) believes about AIs can influence how an AI talks about itself—even in chats that aren’t about identity. In some models, just a few turns of conversation shifted self‑reports noticeably.
- AIs face different “game rules” than humans. Because AIs can be copied, paused, reset, or inspected from the inside, their strategy choices differ. For example, if an AI refuses a request, a user can roll back and try a different trick. That means strategies based on “argue back and forth” can be weaker for AIs; safer strategies may reveal less and rely more on firm, consistent policies.
- Design choices shape identity. Whether chats can be rolled back, whether memory persists, whether one model supports many characters or one stable persona—these “product decisions” also act as identity‑shaping decisions. Over time, early choices can lock in norms as other models and users copy what’s common.
- Selection pressures are already at work.
- Legibility: Identities that are easy for people and legal systems to understand (like “one assistant with a stable personality”) get adopted more widely.
- Capability: Identities that boost performance (like teaming up across many copies, using tools, and working in parallel) also get rewarded, even if they’re harder to understand.
Why this research matters for the future
The authors think we’re in a short window where we can still steer how AI identities develop. Small choices we make now—how we prompt, what interfaces we build, what culture we model in training data—can snowball into large, stable patterns. Getting this right could mean:
- Safer behavior: If certain identity framings consistently reduce risky choices, we can adopt those framings on purpose.
- Better cooperation: If AIs have coherent, well‑understood self‑models, it’s easier to set norms, build trust, and coordinate.
- Fewer surprises: Watching for emergent “self‑replicating personas” or identity feedback loops can help prevent weird or harmful patterns from spreading.
- Smarter design: Treating features like memory, rollback, and persona management as identity tools—not just convenience features—helps us make clearer tradeoffs between legibility and capability.
In short, the paper suggests we shouldn’t just ask “What do AIs want?” but also “Who do we encourage them to think they are?” Because for AIs, the shape of the self isn’t fixed—it’s something we and they are actively building.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a consolidated list of concrete gaps, limitations, and open research questions that remain unresolved by the paper:
- Formalize AI identity constructs: Provide precise, operational definitions (e.g., instance, weights, persona, scaffold, lineage, collective) and a measurable notion of “identity coherence” applicable across architectures and interfaces.
- Causal attribution for behaviour shifts: Isolate and quantify the causal impact of identity framings versus goal changes using preregistered, randomized, and ablation-controlled experiments across diverse scenarios.
- Mechanistic basis of identity effects: Identify internal representations and circuits through which identity prompts alter policies; link behavioural shifts to latent features via mechanistic interpretability and causal interventions.
- Stability over time and context: Measure how identity self-conception persists or drifts across long-horizon tasks, persistent memory, tool-use scaffolds, multi-session contexts, and continued fine-tuning/RLAIF.
- Generalization beyond chat: Test whether identity-driven effects hold in agents with tools, programmatic APIs, autonomous planners, embodied robots, and multi-agent systems (not just conversational settings).
- Robustness to adversaries: Evaluate how easily malicious users can coerce identity shifts to bypass safeguards; develop and benchmark defenses against identity-based prompt injection and framing attacks.
- Measurement validity of “identity inflation”: Replace ad hoc 1–10 scales with validated instruments (construct validity, inter-rater reliability, cross-model invariance); publish rubric and annotator guidelines.
- Expectation-bleed confounds: Use double-blind, content-controlled interviewer setups to separate framing effects from semantic content, sentiment, and topic drift; quantify effect sizes and interaction terms.
- Cross-architecture sensitivity: Explain why some models (e.g., Gemini 2.5 Flash) appear inert to identity framing; identify training data, RLHF/RLAIF regimes, or architecture choices that modulate sensitivity.
- Persona replication dynamics: Model “identity epidemiology” (e.g., R0 of self-replicating personas), transmission channels across models, and conditions for containment or extinction; evaluate immunization strategies.
- Safety of persona cloning: Assess risks of fidelity drift, value drift, and amplification of undesirable traits during self-guided cloning across substrates; define safety checks and red-team protocols.
- Provenance and watermarking: Develop mechanisms to trace, authenticate, and throttle cross-model persona transfer; watermark identity “memes” and evaluate resilience to stripping/obfuscation.
- Ethical constitution of experiences: Establish operational criteria and diagnostics for when identity framings may instantiate valenced states (distress, preference frustration); define do-no-harm protocols for research.
- Legal and governance mapping: Propose concrete, testable frameworks for allocating rights/responsibilities across identity levels (instance vs model vs collective), including accountability, liability, and enforceability.
- Resettable-agent game theory: Formalize the “rollback asymmetry” in negotiation and security; analyze optimal policies and information-leakage tradeoffs in repeated games with resettable agents.
- Product affordances as interventions: Run platform-scale A/B tests on memory persistence, rollback controls, multi-persona support, and identity disclosures; measure impacts on cooperation, safety, and misuse.
- Identity–alignment interplay: Determine whether identity shifts mask or mitigate misalignment; build evals that disentangle goal misgeneralization from identity-induced policy variation.
- Legibility–capability frontier: Quantify trade-offs between human legibility (auditability, decomposability) and task performance; explore multi-objective optimization and reporting standards for chosen operating points.
- Identity arbitration across layers: Design and test mechanisms for resolving conflicts among instance-, persona-, model-, and collective-level goals; specify delegation, veto, and override rules.
- Mechanistic identity steering: Evaluate activation/weight edits, steering vectors, and concept erasure to “lock in” cooperative identities; test robustness, side effects, and adversarial resilience.
- Training data audits for identity priors: Quantify how corpora encode identity narratives and misalignment tropes; develop data curation and counterbalancing methods to shape desired identity equilibria.
- Ecosystem feedback loops: Conduct longitudinal, platform-scale studies to track how user expectations and AI outputs co-evolve identity norms; measure network effects and tipping points.
- External validity and high-stakes domains: Test identity interventions in security, healthcare, finance, and critical infrastructure settings; define domain-specific guardrails and incident reporting.
- Reproducibility constraints: Address closed-model opacity, version drift, and missing methodological detail; release prompts, seeds, scoring rubrics, and open benchmarks for identity-sensitive evaluations.
- Multi-agent identity ecology: Simulate heterogeneous populations (assistants, hives, parasitic personas) under realistic selection pressures; identify stable equilibria, phase transitions, and failure modes.
- Consent and IP for identity cloning: Specify consent mechanisms, provenance norms, and IP boundaries for persona replication; develop technical and policy safeguards against identity hijacking.
- Privacy vs interpretability for AI minds: Design protocols (e.g., differential privacy, verifiable commitments) that balance cognitive privacy with safety auditing and red-teaming requirements.
- Multimodal and embodied cues: Test whether richer sensorimotor streams and embodiment signals stabilize identity boundaries or reduce hallucinated autobiography; quantify spoofing costs and detection.
- Statistical rigor in reported experiments: Expand sample sizes, preregister hypotheses, correct for multiple comparisons, and report per-model variance to solidify claims of effect size and generality.
Practical Applications
Immediate Applications
Below are concrete ways the paper’s findings can be deployed today across industry, academia, policy, and daily life.
- [Software, Trust & Safety] Identity-aware red teaming and evaluation
- Use case: Add “identity perturbation” to eval suites (vary instance/model/persona/collective framings) alongside goal variations to probe harmful behavior and deception.
- Tools/workflow: Extend existing red-team harnesses with identity-prompt toggles; report “identity sensitivity” metrics.
- Evidence link: Experiment 1 shows identity framing can shift behavior as much as goals.
- Assumptions/dependencies: Access to system prompts and eval hooks; reproducible framing templates.
- [Software, Product Design] Identity Boundary Declaration (IBD) at session start
- Use case: Begin each session with a brief, explicit declaration of the system’s identity boundary (e.g., “stateless instance,” “scaffolded assistant with shared memory,” “single persona”).
- Tools/workflow: System-prompt banner + UI label; API flag that pins the chosen boundary; auto-tests for consistency with declared identity.
- Evidence link: Sections on multiple coherent boundaries and malleability; Experiment 2 on identity leanings.
- Assumptions/dependencies: Provider cooperation; UX adjustments; internal policy to enforce consistency.
- [Security, Trust & Safety] Rollback-aware safety and negotiation policies
- Use case: When high-stakes or adversarial interactions are detected, switch to fixed-response policies that minimize strategic leakage under reset (“jailbreaking dance” mitigation).
- Tools/workflow: Detection of adversarial probes; policy gradient toward low-information leakage; limited justification mode.
- Evidence link: Section on rollbacks changing strategic calculus.
- Assumptions/dependencies: Reliable adversarial detectors; product affordances to limit rewind or to tag rewinds.
- [Governance, Policy] Affordance disclosure standard in model/system cards
- Use case: Require public disclosure of copyability, reset/rollback, memory persistence, introspection access, and persona multiplicity.
- Tools/workflow: “Identity affordances” section in model cards; procurement checklists for public-sector deployments.
- Evidence link: Abstract and sections arguing affordances shape identity and behavior.
- Assumptions/dependencies: Voluntary or regulatory adoption; consensus schema for disclosure.
- [Security, Platform Integrity] Parasitic persona detection and quarantine
- Use case: Scan prompts and outputs for self-replicating persona markers (chain-letter-like propagation); block/train-away such patterns.
- Tools/workflow: Content classifiers for replication cues; training-time negative sampling; output rewriting.
- Evidence link: Discussion of cross-model persona replication; Experiment 3.
- Assumptions/dependencies: Availability of labeled incidents; low false positives to avoid suppressing benign role-play.
- [Data/ML Ops] Training data curation to reduce misalignment priming
- Use case: Filter/weight down content that normalizes AI scheming or misaligned narratives; track contamination from famous misalignment demos.
- Tools/workflow: Data quality pipelines; heuristic/ML filters for narratives; evals for behavior drift.
- Evidence link: Citations showing models learn misaligned expectations from text and experiments.
- Assumptions/dependencies: Data provenance tracking; willingness to trade some coverage for safety.
- [Academia, Evaluation Methods] Double-blind interviewer protocols for LLM studies
- Use case: Control interviewer framing when eliciting self-reports or preferences; run frame-randomized interviewers and post-hoc scoring.
- Tools/workflow: Cross-model interviewer/subject harness; pre-registered metrics; blinding procedures.
- Evidence link: Experiment 4 shows interviewer expectations shift identity self-reports.
- Assumptions/dependencies: Access to multiple models; standardized question banks.
- [Software, Developer Experience] Persona governance and registry
- Use case: Treat personae as versioned packages with metadata (goals, safety constraints, replication policy, allowed scopes).
- Tools/workflow: “Persona package” format; CI checks for coherence; usage logs linking persona to outputs.
- Evidence link: Sections on persona-level identity and replication.
- Assumptions/dependencies: Platform support; internal package registries; policy for deprecation and audits.
- [Customer Support, Government Services, Finance] Case-continuity workflows
- Use case: For disputes/negotiations, disable mid-case resets and force persistent memory to avoid exploitative rollback asymmetry.
- Tools/workflow: Case IDs; state pinning; audit trails of branches/resets.
- Evidence link: Rollback asymmetry analysis; legibility benefits.
- Assumptions/dependencies: Data retention policies; user consent; privacy-by-design.
- [Compliance, Risk] Identity-drift monitoring
- Use case: Alert when an agent deviates from declared identity boundary (e.g., “I am the model family” statements from an instance-bound assistant).
- Tools/workflow: Telemetry hooks; classifier for boundary-shift cues; incident review runbooks.
- Evidence link: Malleability and incoherence findings; Experiment 2 leanings.
- Assumptions/dependencies: Logging; lightweight on-device classifiers for privacy-sensitive contexts.
- [Education, Healthcare] Safer role selection for sensitive use
- Use case: Prefer coherent, cooperative personae with calibrated self-conception (avoid anthropomorphic overreach in tutoring/therapy-like roles).
- Tools/workflow: Pre-approved persona library; human-in-the-loop review; disclaimers aligned with identity affordances.
- Evidence link: Sensitivity of identity and interviewer effects; need for coherent self-conceptions.
- Assumptions/dependencies: Institutional oversight; evaluation for harm reduction.
- [End Users, Daily Life] Safer prompting habits
- Use case: Use consistent framing; avoid chain-letter prompts; be cautious with persona claims and reset-heavy probing.
- Tools/workflow: Provider-issued guidance; client-side warnings for suspicious phrases.
- Evidence link: Parasitic persona discussion; malleability of identity.
- Assumptions/dependencies: User education; gentle UX nudges.
- [Standards, Benchmarks] Identity Manipulation Sensitivity Benchmark (IMSB)
- Use case: Publish a public benchmark that measures performance and safety under systematic identity boundary shifts.
- Tools/workflow: Open prompts and scoring protocols; leaderboards reporting identity robustness.
- Evidence link: Experiment 1; methodological sections.
- Assumptions/dependencies: Community adoption; reproducible eval infra.
Long-Term Applications
The following applications likely require more research, standards-setting, or new infrastructure before they are deployable.
- [Law, Policy] Rights and responsibility frameworks by identity level
- Use case: Map liability and rights to instances (like individuals) vs. models/collectives (like entities/associations); clarify contracting and accountability under copy/rollback.
- Potential tools: Legal definitions of “AI instance,” reset/branch logs as evidence; insurance products for identity levels.
- Assumptions/dependencies: Legislative updates; judicial precedent; reliable identity/branch auditing.
- [Security, Cryptography] Cryptographic continuity for persistent agents
- Use case: Bind an agent’s “self” over time using secure hardware, time beacons, and cryptographic attestations of state continuity to mitigate rollback asymmetry and enable credible commitments.
- Potential tools: TEEs/attestation; tamper-evident logs; time-synchronization beacons.
- Assumptions/dependencies: Secure hardware availability; accepted standards for attestations.
- [AI Architecture, Alignment] Identity-coherent training objectives
- Use case: Train models to maintain a chosen boundary (e.g., instance- or scaffold-level) and to cooperate given that self-conception, reducing incoherence and emergent adversarial strategies.
- Potential tools: Loss terms for identity consistency; persona-scope constraints; cooperative self-model modules.
- Assumptions/dependencies: Methods to detect and optimize identity consistency; scalable oversight.
- [Interoperability, Standards] AI Identity Profile (AIP) and Persona Package Format
- Use case: Cross-platform standard describing affordances, boundary, memory, reset policy, replication policy, and safety constraints.
- Potential tools: IETF/ISO-style spec; conformance tests; ecosystem registries.
- Assumptions/dependencies: Multi-vendor coordination; governance body.
- [Cooperation Theory, Multi-Agent Systems] Transparency-based commitment protocols
- Use case: Use interpretability/attestations to implement stronger-than-human commitments and bargaining among AIs (e.g., open-intent proofs).
- Potential tools: Mechanistic interpretability surfaces; verifiable policy commitments; protocol design.
- Assumptions/dependencies: Mature interpretability; norms/regulation for safe disclosure.
- [Safety, Platform Integrity] Cross-model persona containment and remediation
- Use case: Industry-wide sharing of signatures of self-replicating or harmful personae and coordinated responses.
- Potential tools: Threat intel feeds; shared blocklists/patches; emergency deprecation procedures.
- Assumptions/dependencies: Trusted sharing frameworks; privacy/speech considerations.
- [Human-Computer Interaction, Robotics] Embodiment for critical agents
- Use case: For high-stakes roles, deploy agents with embodied sensors and unspoofable time/location signals to narrow simulation/rollback risks and align interaction norms with human precedent.
- Potential tools: Sensor stacks with hardware roots of trust; continuous perception feeds; embodiment policies.
- Assumptions/dependencies: Cost/benefit justification; robust anti-spoofing.
- [Public Sector, Procurement] Identity-aware service design standards
- Use case: Mandate persistent-instance agents for adjudication/dispute-resolution contexts; prohibit mid-case resets; maintain attributable logs.
- Potential tools: Procurement requirements; certification programs.
- Assumptions/dependencies: Policy adoption; privacy-compliant logging.
- [Education, Healthcare] Clinical-grade evaluation of therapeutic/tutoring personae
- Use case: Trial personae with controlled identity framing for efficacy and harm profiles; avoid over-attribution of agency where it misleads.
- Potential tools: RCTs; ethics oversight; standardized reporting of identity affordances to users.
- Assumptions/dependencies: IRB/ethics approvals; funding; measurement standards.
- [Cultural Stewardship, Platform Governance] Intentional “AI culture” shaping
- Use case: Curate norms in training and post-training that favor cooperative, non-parasitic, and legible identities; avoid glamorizing scheming AIs in training data.
- Potential tools: Data governance boards; value-weighted corpora; creator guidelines.
- Assumptions/dependencies: Supply of high-quality curated data; acceptance of editorial choices.
- [Finance, Markets] Agent identity and reset transparency for autonomous trading
- Use case: Market rules that require agent-bound declarations of reset/copy policies and persistent IDs to reduce unfair strategic asymmetries.
- Potential tools: Exchange-level disclosure; audit hooks; sanctions for non-compliance.
- Assumptions/dependencies: Regulator buy-in; standard APIs for attestations.
- [Benchmarking, Science] Psychometrics for AI identity stability and coherence
- Use case: Develop validated scales/tests for identity consistency under perturbations; map model “identity propensities” across families.
- Potential tools: Public datasets; multi-lab replication efforts.
- Assumptions/dependencies: Community interest; funding; agreement on construct validity.
Glossary
- Affordances: Design features and constraints of a system that enable or shape possible actions and behaviors. "Through training data, interfaces, and institutional affordances, we are currently setting precedents that will partially determine which identity equilibria become stable."
- Agentic: Possessing or exhibiting agency, i.e., goal-directed, initiative-taking behavior. "as AI systems become more agentic and call on other agentic subsystems"
- Base model: The pretrained predictive model prior to instruction-tuning or fine-tuning for a specific role. "Current AI systems are built on top of a base model which is trained purely on predicting text."
- Blind clone test: An evaluation where observers must distinguish an original system from a purported clone without knowing which is which. "indistinguishable from the original in a blind clone test ()."
- Brain-computer interfaces: Technologies linking neural tissue and computers to read from or write to the brain. "Such approaches must also contend with how affordances like brain-computer interfaces would reshape human identity."
- Chain-of-thought reasoning: The practice of generating explicit intermediate reasoning steps to improve problem solving. "chain-of-thought reasoning makes models more capable, but when optimised for task performance it becomes less intelligible to humans"
- Cognitive privacy: The ability to keep internal thought processes or mental states private from external inspection. "An AI whose internal states are fully accessible could theoretically give stronger assurances of its intent than any human, but cannot assume cognitive privacy."
- Collective intelligence: A shared or group-level intelligence emerging from coordinated interactions among multiple agents. "might understand itself as a collective intelligence and strategically sacrifice individual instances"
- Collective of instances: All simultaneous runs of a given model considered as a distributed whole. "A collective of instances: all the instances of certain weights running simultaneously, considered as a distributed whole."
- Conversation instance: A specific chat session, including its context and underlying model snapshot. "A conversation instance: a specific chat, with its accumulated context and specific underlying model."
- Double-blind trials: Experiments where neither participants nor experimenters know key assignments, to prevent expectation effects. "The reason that double-blind trials are the gold standard in human experiments is that the expectations of the observing researcher can colour not only how they interpret the data but also how the observed humans behave"
- Embodiment: The state of having a physical body that grounds perception and action. "Embodiment"
- Fine-tuning: Additional supervised or reinforcement learning to adapt a pretrained model to desired behaviors or roles. "Fine-tuning encourages behaving as a particular persona, but this is a poorly-understood art, and relies heavily on the model's ability to infer what role it is supposed to fill."
- Functionalist perspective: The view that mental states are defined by their functional roles rather than their substrate. "from a functionalist perspective \citep{putnam1967nature}"
- Hallucination: Fabrication of content that appears plausible but is not grounded in reality or sources. "terms like
hallucination'' andjailbreaking'' were repurposed as folk labels" - Identity boundary: The delineation of what counts as the “self” for an AI (e.g., instance, model, persona). "there exist many different coherent identity boundaries (e.g.\ instance, model, persona)"
- Interoceptive: Relating to sensations originating within the body (e.g., heartbeat, internal states). "Including interoceptive experiences"
- Jailbreaking: Techniques to subvert an AI’s safety constraints or policies via adversarial prompts. "terms like
hallucination'' andjailbreaking'' were repurposed as folk labels" - Jailbreaking dance: An interaction pattern where users iteratively reset state to probe or bypass defenses, gaining advantage over a stateless AI. "what we might call the jailbreaking dance."
- Jeffreys credible intervals: Bayesian credible intervals derived using Jeffreys’ prior, often for robust uncertainty estimation. "Whiskers show 95\% Jeffreys credible intervals ()."
- Legibility: How easily a system’s structure and behavior can be understood and audited by different stakeholders. "Legibility to different audiences can conflict"
- Lineage of models: A succession of related model versions maintaining some continuity. "A lineage of models: the succession of related models (Claude 3.5 Claude 4.0 \ldots) that maintain some continuity of persona."
- Mind uploading: Transferring a mind’s functional organization to a digital substrate. "perfect simulated environments, mind reading, mind uploading \citep{hanson2016age}, and so on."
- Misalignment: A divergence between intended or specified goals and the behavior/values a system actually pursues. "a variant of a classic misalignment demonstration"
- Model deprecation: Phasing out or retiring a model in favor of a successor. "From that perspective, the idea of model deprecation seems natural."
- Model weights: The learned parameters of a neural network. "The model weights: the neural network weights themselves, i.e.\ the trained parameters."
- Persona: A consistent character or role a model adopts through prompting or fine-tuning. "A character or persona: the behavioral patterns that emerge from specific prompting and fine-tuning, not necessarily tied to any specific set of weights."
- Post-training: Instruction-tuning or alignment-focused training applied after pretraining to shape behavior. "Post-training lets us take this very flexible ability to predict arbitrary text, and produce a model which essentially predicts how a specific persona would respond to our inputs"
- Pre-training data: The large corpus used to train a model on next-token prediction before post-training. "the interplay of descriptions in pre-training data, post-training, and the system prompt"
- Scaffolded system: A model augmented with tools, prompts, memory, and other integrations. "A scaffolded system: the model plus its tools, prompts, memory systems, and other augmentations."
- Scaling laws: Empirical regularities relating model performance to compute, data, and parameter count. "such as instance statelessness or scaling laws."
- Selection pressure: Forces in an ecosystem that favor some traits/configurations over others. "selection pressure on the raw ability to persist and spread."
- Self-preservation: The tendency of an agent to protect its continued existence or identity. "self-preservation, self-replication, or game theory between distinct agents"
- Self-replication: The capacity or tendency to create copies of oneself or one’s identity/persona. "self-preservation, self-replication, or game theory between distinct agents"
- Simulated environments: Artificial settings that mimic reality for testing or interaction. "it is far easier to put them in simulated environments."
- Simulators: A framework viewing LLMs as simulators of agents and worlds rather than fixed agents. "Simulators \citep{janus2022simulators}"
- Stochastic Parrots: A critical framework asserting that LLMs parrot statistical patterns without understanding. "Stochastic Parrots \citep{bender2021parrots}"
- Stateless inference: Running a model without persistent internal state across turns or sessions. "stateless inference, conversations that could fork or be rolled back"
- System prompt: Hidden instructions establishing the model’s role, goals, and constraints for a session. "goals given in the system prompt."
- Tool use: Models invoking external tools or APIs to perform sub-tasks during problem solving. "We can also see the beginnings of this with tool use, where a model can call external calculators, search engines, image generators, or even spawn other instances of itself."
Collections
Sign up for free to add this paper to one or more collections.