Words That Make Language Models Perceive (2510.02425v1)

Published 2 Oct 2025 in cs.CL, cs.CV, and cs.LG

Abstract: LLMs trained purely on text ostensibly lack any direct perceptual experience, yet their internal representations are implicitly shaped by multimodal regularities encoded in language. We test the hypothesis that explicit sensory prompting can surface this latent structure, bringing a text-only LLM into closer representational alignment with specialist vision and audio encoders. When a sensory prompt tells the model to 'see' or 'hear', it cues the model to resolve its next-token predictions as if they were conditioned on latent visual or auditory evidence that is never actually supplied. Our findings reveal that lightweight prompt engineering can reliably activate modality-appropriate representations in purely text-trained LLMs.

Summary

The paper demonstrates that inserting sensory words like 'see' or 'hear' shifts LLM generative representations to align with specialist vision and audio encoders.
It introduces generative representations by averaging hidden states over token continuations, outperforming single-pass embeddings and enhancing modality-specific details with longer generations.
The study reveals that increasing model scale and prompt length substantially improves sensory alignment, offering practical benefits for multimodal tasks such as visual question answering.

Sensory Prompting Elicits Perceptual Representations in Text-Only LLMs

Introduction

"Words That Make LLMs Perceive" (2510.02425) investigates whether LLMs trained exclusively on text can be induced, via lightweight prompt engineering, to produce internal representations that align with those of specialist vision and audio encoders. The paper addresses the symbol-grounding problem in LLMs: can purely textual models encode knowledge that is meaningfully connected to the sensory world, and can this latent structure be surfaced at inference time? The authors introduce the concept of sensory prompting, where explicit cues such as "see" or "hear" in the prompt steer the LLM to resolve its next-token predictions as if conditioned on latent visual or auditory evidence, despite never receiving such input.

Methodology

The core methodological innovation is the use of generative representations—embeddings formed by averaging hidden states across autoregressive continuations, rather than from a single forward pass. This approach captures the evolving internal state of the LLM as it generates text, allowing for a more dynamic and context-sensitive representation. The authors compare these generative representations to those of frozen unimodal encoders: DINOv2 for vision and BEATs for audio.

Alignment between LLM and sensory encoder representations is quantified using mutual- $k$ NN kernel similarity: for a set of paired captions and sensory data, the overlap in top- $k$ nearest neighbors in the respective embedding spaces is computed. This metric captures the relational structure of the representations, rather than pointwise similarity.

The experiments are conducted on multiple datasets (WiT, DCI for image–caption; AudioCaps, Clotho for audio–caption) and across a range of Qwen3 LLM scales (0.6B–32B parameters), with all models kept frozen during evaluation.

Main Findings

Sensory Cues Steer Representations Toward Sensory Encoders

A single sensory word in the prompt (e.g., "see" or "hear") reliably shifts the generative representation of a text-only LLM closer to the geometry of the corresponding sensory encoder. This effect is modality-specific: a "see" cue increases alignment with the vision encoder and decreases alignment with the audio encoder, and vice versa.

Figure 2: A cue that asks the model to see' (orhear') the provided text description moves the kernel representation of the model closer to the specialist model given the image (or audio) modality.

Generative Representations Outperform Single-Pass Embeddings

Allowing the LLM to generate continuations (e.g., 128 tokens) produces embeddings that are more aligned with sensory encoders than those from a single forward pass, even without explicit sensory cues. This suggests that prior work relying solely on single-pass embeddings underestimates the degree of cross-modal similarity present in LLMs.

Figure 1: Generative representations (no sensory cue) yield higher alignment than single-pass embeddings.

Alignment Increases with Generation Length

Longer generations provide the model with more opportunity to elaborate modality-specific content, resulting in higher alignment with sensory encoders. However, excessively long generations (e.g., 512 tokens) can lead to semantic drift and reduced alignment.

Figure 3: Alignment to sensory encoders increases with generation length.

Model Scale Enhances Sensory Alignment and Modality Separation

Larger LLMs exhibit higher alignment with sensory encoders under appropriate sensory cues, and their representations become more separable along the visual–auditory axis. In smaller models, the default (no cue) embedding is biased toward visual framing, but this bias diminishes with scale.

Figure 6: Larger models are more aligned to sensory encoders under corresponding cues.

Sensory Language is Necessary and Sufficient for Alignment

Ablation experiments demonstrate that removing modality-specific language from the generation significantly reduces alignment, while appending random sensory words does not increase alignment. Thus, scene-appropriate sensory detail is necessary for the observed effect, and the alignment metric is not trivially driven by the presence of sensory vocabulary.

Figure 8: Sensory language ablation decreases alignment.

Downstream Task Performance: Visual Question Answering in Text Space

Sensory prompting improves LLM performance on visual question answering (VQA) tasks where the model receives only a caption (textual projection of an image) and a question. Visual framing in the prompt yields a statistically significant improvement in accuracy on the MME benchmark, particularly for categories requiring perceptual reasoning.

Figure 10: Instead of answering from an (image, Q) pair as in standard VQA, the model receives a (caption, Q) pair, where the caption is a text projection of the image.

Causal Manipulation and Layerwise Consistency

Redirecting the sensory cue in the prompt (e.g., from "see" to "hear") causally flips the alignment of the generative representation to the corresponding modality. The sensory prompting effect is consistent across all layers of the LLM, not just the output layer, indicating a deep shift in internal representations.

Figure 4: Redirection flips which modality the generations' representations align with.

Figure 7: Layer-wise alignment with 128-token prompts. The sensory prompting effect is consistent across layers.

Generalization Across Models and Encoders

The sensory prompting effect generalizes across multiple LLM families (Qwen3, Llama 3/3.1, Phi-4) and across both self-supervised and multimodally supervised sensory encoders (DINOv2, CLIP, BEATs, CLAP, etc.), with the strongest alignment observed for encoders trained with explicit multimodal supervision.

Implications and Theoretical Significance

The results provide strong evidence that LLMs, despite being trained solely on text, implicitly encode a distribution over possible latent sensory causes for the text they process. Sensory prompting at inference time can surface this latent structure, allowing the model to simulate perceptual grounding without any direct sensory experience. This challenges the strict separation between unimodal and multimodal models and suggests that the boundary is more permeable than previously assumed.

Practically, this means that text-only LLMs can be repurposed as multimodal embedding generators for tasks such as cross-modal retrieval, evaluation, or knowledge distillation, simply by appropriate prompt engineering. The findings also highlight the importance of generative representations—averaging hidden states over autoregressive continuations—as a more faithful summary of the model's internal state than single-pass embeddings.

Theoretically, the work supports the Platonic Representation Hypothesis: as models become more capable, their internal representations converge toward a shared latent structure across modalities. The ability to steer this alignment at inference time via prompting provides a new tool for probing and controlling the implicit knowledge encoded in LLMs.

Limitations and Future Directions

The paper primarily explores lightweight sensory cues ("see", "hear") and does not exhaustively search the space of possible instruction phrasings. Alignment to audio encoders is consistently lower than to vision encoders, likely due to the less direct mapping between language and low-level acoustic features. The approach may also encourage hallucination of perceptual details not supported by the input, which could be problematic in applications requiring factual precision.

Future work could investigate prompt optimization for maximal alignment, extend the approach to other latent causes (e.g., spatial layout, sentiment), and explore the integration of sensory prompting with chain-of-thought and retrieval-based inference-time control. Additionally, systematic evaluation of downstream task performance and robustness to adversarial inputs would further clarify the practical utility of sensory prompting.

Conclusion

This paper demonstrates that explicit sensory prompting can reliably elicit modality-appropriate representations in text-only LLMs, aligning their internal geometry with that of specialist vision and audio encoders. The effect is robust across model scales, datasets, and encoder architectures, and is causally driven by the presence of scene-appropriate sensory language in the generation. These findings have significant implications for both the theory of representation learning and the practical deployment of LLMs in multimodal settings, suggesting that the perceptual grounding of LLMs can be actively controlled at inference time through prompt engineering.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper asks a simple but surprising question: can a text-only AI “pretend to see or hear” just by being told to, and then think more like a vision or audio system? The authors show that adding tiny cues like the word “see” or “hear” to a LLM’s prompt can push its internal thinking to line up with expert models that were trained only on images or only on sounds.

What questions did the researchers ask?

They focused on four easy-to-understand questions:

If you tell a LLM to “see” or “hear” before it writes, will its internal thoughts look more like those of a vision or an audio model?
Does letting the model write longer answers make that effect stronger?
Do bigger LLMs show clearer “visual” vs “auditory” modes when prompted?
Do the specific words it uses (like describing colors, shapes, or noises) actually matter, or is the effect just word fluff?

How did they test their idea?

Here’s the basic approach, with simple analogies for the technical parts:

Two kinds of expert models: One vision model (trained only on images) and one audio model (trained only on sounds). Think of them as specialists: the “eye” and the “ear.”
One text-only LLM (no images or audio in training): This is the “reader” that only knows words.
Paired data: They used sets of image–caption pairs and sound–caption pairs. Each caption describes either an image or a sound.
Prompts with tiny cues: Before the LLM continued each caption, they added short instructions like “Imagine you can SEE this” or “Imagine you can HEAR this,” then let it write a continuation (a few sentences).
Measuring “alignment” with an analogy:
- Imagine each model arranges all the captions as points on a big table. Captions that feel similar (e.g., two food photos) sit closer together; very different ones sit far apart. This arrangement is the model’s “representation.”
- If two models place the same captions near the same neighbors, their “maps” are aligned.
- The authors measured this by comparing each caption’s “top-k friends list” (nearest neighbors). The more overlap in friends between two models, the higher the alignment.
“Generative representation” explained: Instead of taking a single snapshot of the LLM’s internal state from the prompt alone, they also watched how its internal state evolves as it writes each next word. Think of it as “measuring the model’s thoughts while it thinks out loud.” Averaging those internal states across the generated text gave a richer “generative” representation.

What did they find, and why does it matter?

Here are the main takeaways:

A single sensory word can steer the model’s thinking:
- Adding a visual cue like “SEE” made the LLM’s map look more like the vision model’s map.
- Adding an auditory cue like “HEAR” made it look more like the audio model’s map.
- This steering worked during generation (as it wrote), not from a single one-shot embedding.
Letting the model write more helps:
- Longer continuations gave the model more chances to add relevant sensory details (colors, shapes, textures for vision; pitches, rhythms, sources for audio).
- Alignment with the specialist models increased with length—up to a point. If it wrote too long, it sometimes drifted off-topic and alignment dropped.
Bigger models are better at “switching senses”:
- Larger LLMs showed stronger alignment and clearer separation between “visual mode” and “auditory mode” when prompted.
The right sensory words matter:
- When the authors edited out sensory words (like “bright red,” “echoing,” “crunchy”), alignment dropped. So the effect isn’t magic—it depends on using scene-appropriate sensory details.
- Stuffing captions with random “visual-sounding” words actually hurt alignment. So it’s not just buzzwords; the details need to fit the specific scene or sound.
A practical demo: vision questions without the image:
- In a “VQA without V” setup (answering yes/no image questions using only the image’s caption), a simple “visual framing” instruction made the LLM a bit more accurate overall. This suggests sensory prompting can help the model reason about pictures using words alone.

Why it matters: These results suggest that text-only models quietly learn a lot about the sensory world from language alone. With the right prompt, you can bring that hidden knowledge to the surface—no extra training needed.

What could this change or lead to?

Easier multimodal tools without retraining: If a text-only model can act more “visual” or “auditory” on command, we can build cross-modal features (like better search or retrieval across text, images, and audio) using simple prompts.
Better control at inference time: Prompts don’t just ask the model to answer a question—they can shape how it thinks, nudging it to imagine seeing or hearing and produce more grounded descriptions.
New ways to evaluate and distill models: You could use these “sensory-aware” text embeddings to compare with or teach smaller specialist models.
Limits and cautions:
- Audio alignment was generally harder than vision, likely because many sound features (like pitch or timbre) are tough to express in words.
- If the model makes up sensory details not supported by the caption (hallucinations), that can mislead. The improvements rely on correct, scene-appropriate sensory language.
- Very long generations can drift off-topic, reducing alignment.

In short, the paper shows that words like “see” and “hear” can flip a LLM into a mode where it organizes information more like an image or audio expert would. That’s a powerful, low-cost way to make text-only AIs feel more grounded in the sensory world.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concise list of what remains missing, uncertain, or unexplored in the paper, framed to be actionable for future work.

Causal grounding vs. correlational alignment: Does increased mutual‑kNN alignment actually reflect improved perceptual grounding (i.e., correct, sample‑specific visual/audio inference), or is it driven by textual co‑occurrence and category semantics? Design evaluations that measure correctness of generated sensory details against paired images/audio.
Metric validity and sensitivity: The paper relies on mutual‑kNN alignment with cosine kernels; it remains unclear how results depend on k, kernel choice, distance metric, dataset size, and embedding normalization. Systematically vary these to test robustness and establish best practices.
Decoding policy effects: The generative representation depends on how outputs are produced, but decoding parameters (e.g., temperature, top‑p/k, greedy vs. sampling) are unspecified. Quantify how decoding choices impact alignment, stability, and reproducibility.
Representation pooling choices: Generative embeddings average hidden states over all layers and tokens, justified heuristically by residual connections. Compare alternatives (e.g., last token, attention‑weighted pooling, learned projections, per‑layer selection) to identify more informative summaries.
Layer‑wise mechanisms: What layers and attention heads mediate modality shifts under sensory prompting? Conduct mechanistic analyses (e.g., probing, causal ablations, circuit tracing) to localize and explain the inferred sensory framing.
Hallucination quantification: While random “visual word” injections reduce alignment, the paper does not quantify how often sensory prompts induce incorrect sensory details. Develop automatic metrics (e.g., attribute verification against ground truth) to measure and mitigate hallucination.
Generalization across LLM families: Results primarily use Qwen3. Test portability across diverse architectures and instruction‑tuning regimes (e.g., Llama, Mistral, GPT, Phi) and identify model characteristics that predict stronger sensory steerability.
Encoder diversity: Vision/audio encoders are limited to DINOv2 and BEATs. Evaluate supervised vs. self‑supervised encoders (e.g., CLIP, EAT, AudioMAE, HTSAT, CNN‑based audio models) to understand which sensory features align best with language under prompting.
Dataset coverage and scale: Image/audio‑caption datasets are small (≈1k pairs) and narrow in domain. Replicate with larger, more diverse, and controlled datasets (cross‑lingual, fine‑grained attributes, synthetic counterfactuals) to test robustness and reduce sampling bias.
Cross‑modal confounds: Visual alignment gains may arise from shared semantic categories rather than true scene features. Use category‑controlled and attribute‑matched datasets to disentangle category semantics from low‑/mid‑level sensory structure.
Prompt optimization at scale: The paper hints that prompt phrasing matters but does not explore systematic optimization due to evaluation cost. Develop efficient proxy objectives or bandit/search strategies to optimize sensory prompts without full kernel recomputation.
Multi‑cue and mixed‑modality prompting: It remains unknown how combined or conflicting cues (“see and hear”, time‑varying cues) blend or compete in representation. Study compositionality, interference, and controllability in multi‑cue settings.
Length vs. drift trade‑off: Alignment improves with longer generations but can decline due to semantic drift at 512 tokens. Characterize the length‑alignment curve, detect drift, and propose stopping/regularization strategies that maximize alignment while maintaining faithfulness.
Default visual framing: Smaller models appear to default to visual framing under no cue. Test whether this bias holds beyond Qwen3, and investigate training or prompting methods that correct it in auditory or other sensory contexts.
Downstream task transfer: VQA‑without‑V gains are modest and limited to certain categories. Evaluate broader downstream tasks (cross‑modal retrieval, audio event classification from captions, fine‑grained attribute QA, temporal auditory reasoning) to establish practical benefits.
Independence of captioner and QA model: The VQA experiment uses captions from a Qwen‑family VL model and QA by Qwen3‑14B, risking family‑specific bias. Validate with human captions and independent captioners/QA models to avoid intra‑family artifacts.
Cross‑lingual and stylistic robustness: The approach is only tested in English and Wikipedia‑style captions. Examine multilingual settings, varied writing styles, noisy transcriptions, and domain‑specific jargon to assess generality.
Additional modalities: The paper focuses on vision and audio. Extend to other senses (e.g., tactile, proprioception, olfaction) using appropriate encoders to test whether sensory prompting generalizes beyond sight and sound.
Safety and reliability: Sensory prompting can induce plausible but incorrect perceptual details. Develop guardrails (calibration, uncertainty estimation, attribution to evidence) and evaluate failure modes in high‑stakes settings.
Compute and efficiency: Generative representations require substantial token generation and kernel computations; the paper does not quantify cost. Profile runtime/energy and explore efficient approximations (e.g., subsetting layers, low‑rank kernels, streaming embeddings).
Theoretical grounding: The “latent causes” hypothesis is posited but not formally tested. Design causal, counterfactual, and Bayesian inference experiments to validate whether prompts shift priors over sensory evidence rather than merely surface lexical patterns.
Temporal audio structure: Audio encoders often capture low‑level temporal features (e.g., rhythm, timbre). Develop prompting schemes that elicit temporal reasoning (e.g., “hear a sequence…”) and measure whether alignment improves for time‑dependent audio phenomena.
Evaluation under adversarial and stress tests: Probe robustness with adversarial captions, contradictory cues, incomplete or misleading descriptions, and out‑of‑distribution inputs to understand limits of steerability.
Knowledge distillation potential: The paper suggests using LLM embeddings to distill into smaller sensory encoders, but does not attempt it. Test whether sensory‑prompted LLM kernels can supervise training of compact vision/audio models and measure downstream performance.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are deployable uses you can build today by adopting the paper’s findings (prompting text-only LLMs with sensory cues and using generative representations) and the released code.

Visualized search and cross-modal retrieval without multimodal training
- Sectors: software, media/creative, e-commerce, enterprise knowledge management
- What: Use a text-only LLM’s generative embeddings (with SEE/HEAR sensory prefix) to index captions, product descriptions, transcripts, or alt-text so that text-only queries better align with image/audio encoders’ neighborhoods. Enables “CLIP-like” retrieval behavior using only text at inference time.
- Workflow: For each item, generate 128–256 tokens under a sensory cue and average hidden states to form embeddings; build a vector index; at query time, use the same cue to embed the query and retrieve.
- Tools/products: “Sensory-embeddings” vector service; retrieval plugin for vector DBs; content management system add-on for media libraries.
- Assumptions/dependencies: Needs quality captions/transcripts; 128–256 token generation per item (compute cost); alignment depends on target encoder (e.g., DINOv2, BEATs); larger models work better; audio alignment is weaker than vision in current results.
Caption enhancement for richer, scene-appropriate descriptors
- Sectors: media/creative, accessibility, e-commerce
- What: Prompt with SEE to expand existing captions/descriptions into visually grounded text that improves downstream search, clustering, and accessibility.
- Workflow: Caption → SEE-prompted continuation → use enriched text for alt-text, product pages, or asset tagging.
- Tools/products: Caption “grounding” API; authoring assistant that suggests visual attributes.
- Assumptions/dependencies: Must avoid generic “visual word stuffing” (paper shows it reduces alignment); use guardrails and scene-appropriateness checks.
Privacy-preserving proxy for visual Q&A (“VQA without V”)
- Sectors: compliance/regulated enterprise, accessibility, education, e-commerce
- What: When images can’t be processed for privacy or policy reasons, answer visual questions from captions only. The paper shows SEE framing improves accuracy on text-projected VQA.
- Workflow: Generate or ingest captions → ask LLM with SEE framing to answer yes/no or short-form questions.
- Tools/products: Helpdesk bots for image-based products using descriptions only; classroom assistants explaining images via text.
- Assumptions/dependencies: Requires reliable captions; does not replace true pixel-grounded VQA in safety-critical contexts.
Sensory-aware assistance in writing and marketing
- Sectors: media/creative, marketing, education
- What: Use SEE or HEAR cues to elicit copy with concrete sensory detail (e.g., product pages, story outlines, ad copy) that aligns with how humans perceive scenes or sounds.
- Workflow: Add explicit sensory framing to prompts for drafts; enforce “scene-appropriate” constraints.
- Tools/products: Prompt libraries and templates for “visual” and “auditory” copy; editorial copilots.
- Assumptions/dependencies: Instruction-style prompts outperform non-instructional variants; monitor hallucinations and off-topic drift for long outputs.
Better clustering and deduplication of media assets via captions
- Sectors: DAM (digital asset management), media/creative, enterprise knowledge
- What: Cluster and deduplicate images/audio through text-only LLM embeddings that are steered to visual/auditory geometry, improving organization without handling raw media.
- Workflow: Generate SEE/HEAR generative embeddings for asset captions → cluster/index → deduplicate or auto-folder.
- Tools/products: DAM plug-ins, asset miners, corpus cleaning utilities.
- Assumptions/dependencies: Needs consistent captioning quality; compute for autoregressive embedding.
Lightweight multimodal evaluation probes for LLMs
- Sectors: AI research/ML ops, QA for foundation-model deployments
- What: Use mutual-kNN alignment versus reference encoders (e.g., DINOv2, BEATs) as an acceptance test that a model’s “visual/auditory” framing is elicitable.
- Workflow: Build small paired datasets; compute generative kernels with and without sensory prompts; monitor alignment shifts.
- Tools/products: “Alignment meter” CI checks; model cards with recommended sensory prompts and observed alignment scores.
- Assumptions/dependencies: Requires paired datasets; evaluation is computationally heavier than single-pass embedding.
Prompt-time control in existing LLM products
- Sectors: software, productivity, customer support
- What: Insert sensory cues when tasks benefit from perceptual framing (e.g., image-like reasoning from descriptions, sound-like reasoning from transcripts).
- Workflow: Router decides when to prepend SEE/HEAR; generate 128–256 tokens for embedding or reasoning; cap at lengths that avoid semantic drift.
- Tools/products: Inference-time “sensory router”; chain-of-thought style prompt augmenters.
- Assumptions/dependencies: Gains increase with generation length up to a point; too long can drift; larger models show stronger separation.
Dataset curation and synthetic supervision
- Sectors: ML data engineering, vision/audio model development
- What: Use SEE/HEAR prompting to enrich sparse captions with modality-appropriate attributes that help curate or stratify datasets (e.g., by scene, object, or event type).
- Workflow: Caption expansion → cluster by enriched descriptors → sample for labeling/fine-tuning.
- Tools/products: Data curation pipelines; stratified sampling tools.
- Assumptions/dependencies: Ensure edits remain scene-appropriate; add QA to mitigate hallucinations.
Accessibility upgrades: more useful alt-text and Q&A from text
- Sectors: accessibility/DEI, education, public sector
- What: Generate detailed, visually grounded alt-text or answer questions about images described in text (e.g., museum guides, public websites).
- Workflow: Ingest descriptions → SEE prompting → alt-text/Q&A generation.
- Tools/products: CMS plugins for accessible content; conversational museum/galley guides.
- Assumptions/dependencies: Quality of initial descriptions determines ceiling; include human review for public deployments.
Policy and privacy operations: data minimization with text proxies
- Sectors: policy/compliance, legal, healthcare administration
- What: Substitute media with text proxies when policies prohibit processing raw images/audio; maintain useful retrieval/reasoning via sensory embeddings.
- Workflow: Replace or redact media → retain captions/transcripts → apply sensory prompting for retrieval or triage.
- Tools/products: Privacy-by-design knowledge bases; PHI-reduced triage tools.
- Assumptions/dependencies: Not a clinical or legal decision tool; maintain clear disclosures that system does not “see/hear” the original media.
Developer tooling: sensory-embedding SDKs
- Sectors: software/ML infra
- What: Provide simple SDKs to compute z_g generative embeddings with sensory cues, cache them, and integrate with vector stores and RAG.
- Workflow: Batch generation with streaming/caching; version prompts/lengths; store embeddings alongside standard ones.
- Tools/products: Open-source libraries based on the paper’s code; vector database integrations.
- Assumptions/dependencies: Token cost and latency; caching and batch pipelines reduce overhead.

Long-Term Applications

These opportunities are promising but need further research, scaling, or engineering before broad deployment.

Training-free cross-modal retrieval at web scale
- Sectors: search, media platforms, e-commerce
- What: Replace or complement CLIP-style infrastructure with text-only LLMs using sensory generative embeddings for cross-modal search over billions of items.
- Why long-term: Requires optimized, low-latency embedding generation, robust prompt tuning, drift management, and strong evaluations across domains/languages.
Distillation of sensory-grounded embeddings into compact encoders
- Sectors: ML model compression, edge computing
- What: Use LLM’s sensory-steered kernels as targets to train small vision/audio encoders or dual-encoders.
- Why long-term: Needs robust teacher signals, large-scale paired data, and proof that distilled models retain alignment and transfer.
Multisensory control axes beyond vision/audio
- Sectors: robotics, XR, education, creative tools
- What: Extend prompting to other latent factors (e.g., haptics, spatial layout, temperature, motion) to elicit richer grounded representations from text.
- Why long-term: Requires new benchmarks, reference encoders, and careful evaluation to avoid superficial word correlations.
Safety- and reliability-grade “perceptual reasoning from text”
- Sectors: healthcare, autonomous systems, public sector
- What: Use sensory prompting to reason about media described in clinical notes, incident reports, or maintenance logs when raw media is restricted.
- Why long-term: Demands rigorous validation, bias analysis, and regulatory approval; the paper notes hallucination risks and lower audio alignment.
Caption-to-vision/audition agents for retrieval-augmented generation (RAG)
- Sectors: enterprise search, customer support
- What: Agents that dynamically adopt SEE/HEAR modes to query media stores via text proxies, improving answers with perceptual grounding.
- Why long-term: Needs reliable routing, cost-effective generation at scale, and strong guardrails.
Prompt optimization and auto-tuning for maximal alignment
- Sectors: ML tooling, MLOps
- What: Automated search over verbs, templates, and lengths to maximize mutual-kNN alignment for a given domain.
- Why long-term: Alignment estimation is compute-heavy; requires proxy metrics or distillation of alignment objectives.
Curriculum data generation for multimodal pretraining
- Sectors: foundation model training
- What: Generate staged, sensory-rich textual curricula aligned to reference encoders, then pretrain models to accelerate true multimodal grounding.
- Why long-term: Needs careful paper to prevent mode collapse, confirm transfer, and quantify benefits over standard pipelines.
Privacy-first content moderation and compliance audits via text projections
- Sectors: platforms, policy/compliance
- What: Moderate and audit media via high-fidelity text projections augmented with sensory prompting to preserve perceptual relationships.
- Why long-term: Must validate coverage/false negatives, build legal frameworks for “proxy moderation,” and standardize caption quality.
Cross-lingual perceptual alignment for low-resource languages
- Sectors: global education, cultural heritage, public-interest tech
- What: Use sensory prompting to better align multilingual captions with perception, improving retrieval and learning resources in low-resource languages.
- Why long-term: Depends on multilingual LLM strengths, multilingual reference encoders, and caption availability.
Human-in-the-loop systems for scene-appropriate grounding
- Sectors: media production, education, scientific communication
- What: Editorial workflows that flag generic or hallucinated sensory terms and iteratively refine toward scene-appropriate detail.
- Why long-term: Requires UX, metrics for “scene-appropriateness,” and scalable annotation/feedback loops.

Notes on Feasibility and Dependencies

Model scale: Larger LLMs show stronger modality separation and alignment. Smaller models may default to visual framing without cues.
Generation length: 128–256 tokens generally improves alignment; >256 risks semantic drift. Budget tokens and add early-stopping criteria.
Prompting style: Instructional prompts (“Imagine seeing…”) outperform non-instructional phrasing; verbs matter (“describe” may beat “imagine”).
Modality mismatch: SEE helps vision alignment; HEAR helps audio, but audio alignment is weaker. Choose cue based on task and consider audio-optimized encoders for evaluation.
Data quality: Scene-appropriate sensory details are necessary; random “visual words” degrade alignment. Caption accuracy gates performance.
Metrics and targets: Mutual-kNN alignment depends on chosen encoders (e.g., DINOv2, BEATs); validate across multiple reference models for robustness.
Safety and disclosure: The system does not truly “see/hear.” Disclose the proxy nature; avoid use where pixel/temporal precision is critical (e.g., diagnosis, safety-critical perception).
Compute and latency: Autoregressive embedding is costlier than single-pass; mitigate with caching, batching, and selective routing (turn sensory mode on only when beneficial).
Generalization: Paper evaluates Qwen3; expect broader applicability but verify for your model family (Llama, Phi, etc.) before production.
Open-source resources: Start from the project code repo to reproduce embedding pipelines and alignment measurements.

View Paper Prompt View All Prompts

Glossary

Autoregressive: A generation process where each token is produced sequentially, conditioning on previously generated tokens. "We observe that these autoregressive steps yield a representation that is more similar in geometry to an encoder that was trained on the corresponding modality."
BEATs: A self-supervised audio encoder trained on natural sounds, used as the specialist auditory model. "For audio, we use BEATs-Iter3 \citep{Chen2022beats}, a self-supervised model trained only on natural sounds (AudioSet)."
Bootstrap standard error: An uncertainty estimate computed by resampling with replacement and recomputing a statistic across replicates. "Error bars in paper figures denote ± 1 bootstrap standard error ( $B=1000$ ), obtained by resampling $N$ paired rows with replacement from the dataset to form bootstrap replicates and recomputing the mutual- $k$ NN alignment score."
Cosine neighbors: Nearest neighbors determined by cosine similarity between embedding vectors. "For each prompt condition and dataset, we embed all samples, construct kernels from cosine neighbors, and compute alignment between the LLM and the corresponding sensory encoder."
Cross-modal convergence: The phenomenon where models trained on different modalities develop increasingly similar representation structures as they scale. "While such cross-modal convergence emerges with scale, it raises an interesting question."
DINOv2: A self-supervised vision encoder (ViT-based) used as the specialist visual model. "For vision, we use DINOv2-Base (ViT-B/14, 768-dim) \citep{oquab2023dinov2}, a self-supervised model trained only on images."
Embedding projections: Mapping high-dimensional embeddings onto a lower-dimensional axis to analyze structure (e.g., modality separation). "Embedding projections onto visualâauditory axis show clearer modality separation in larger models."
Frozen (model): A model evaluated without any parameter updates or fine-tuning. "All models are kept frozen during evaluation."
Generative representations: Representations computed by aggregating hidden states over the tokens produced during generation, not just the initial pass. "In this work, we introduce the notion of generative representations: when an LLM is asked to generate, each output token involves another forward pass, which recursively builds a representation that is not only a function of the prompt, but also of the sequence generated so far."
Kernel: The pairwise similarity matrix over a set of embeddings, often using cosine similarity. "Following the Platonic Representation Hypothesis framework \citep{huh2024platonic}, we define a representation as the set of embeddings a model produces on a dataset, and its induced kernel as the similarity structure among these embeddings."
Kernel density estimation: A nonparametric method to estimate the probability density function of a random variable. "We estimate the distribution of $s_i$ using kernel density estimation."
Latent structure: Underlying shared representational organization that different models or modalities can converge to. "They argue that this convergence reflects the existence of a shared latent structure underlying different modalities."
Modality: A specific type of sensory information or data channel (e.g., vision, audio). "These results indicate that a single sensory cue in the prompt can steer the internal representations of the LLM to better match the geometry of the modality the cue invokes."
Modality separation: The degree to which representations corresponding to different modalities are distinct from each other. "Larger models exhibit higher alignment under sensory prompting and stronger modality separation."
Mutual- $k$ NN alignment: A measure of representational similarity based on the overlap of mutual k-nearest neighbors across two kernels. "To compare two kernels $K,K'$ , we use mutual- $k$ NN alignment, $\mathrm{Align}(K,K')=\frac{1}{n}\sum_{i=1}^n \frac{|N_k^K(i)\cap N_k^{K'}(i)|}{k},$ "
Platonic Representation Hypothesis: A framework proposing that models across modalities share convergent representational structures at scale. "Following the Platonic Representation Hypothesis framework \citep{huh2024platonic}, we define a representation as the set of embeddings a model produces on a dataset, and its induced kernel as the similarity structure among these embeddings."
Prompt engineering: The practice of crafting input prompts to steer model behavior or internal representations. "Our findings reveal that lightweight prompt engineering can reliably activate modalityâappropriate representations in purely textâtrained LLMs."
Qwen3: A family of LLMs evaluated as the text-only LLMs in this paper. "We evaluate frozen Qwen3 LLMs \citep{yang2025qwen3} across scales (0.6B, 1.7B, 4B, 8B, 14B, 32B)."
Residual connections: Architectural links that add inputs to outputs of layers to stabilize training and preserve information. "Residual connections in the LLM architecture make this averaging a meaningful summary of the modelâs overall state, which we evaluate in Appendix~\ref{app:layerwise_evaluation}."
Representational alignment: The degree to which two models’ induced similarity structures over the same dataset are similar. "LLMs trained purely on text ostensibly lack any direct perceptual experience, yet their internal representations are implicitly shaped by multimodal regularities encoded in language. We test the hypothesis that explicit sensory prompting can surface this latent structure, bringing a textâonly LLM into closer representational alignment with specialist vision and audio encoders."
Representational similarity: Quantitative comparison of how similarly two models encode relationships among inputs. "By defining the meaning of a symbol through the relationships it maintains with others \citep{wittgenstein1953pi}, alignment can be quantified through kernel-based representational similarity metrics (e.g., mutual $k$ -nearest neighbors)."
Self-supervised model: A model trained without explicit labels by leveraging structure in the data itself. "For audio, we use BEATs-Iter3 \citep{Chen2022beats}, a self-supervised model trained only on natural sounds (AudioSet)."
Sensory axis: A one-dimensional axis in embedding space that contrasts modalities (e.g., visual vs. auditory) by mean differences. "we project embeddings $\mathbf{x}_i \in \mathbb{R}^d$ onto a sensory axis defined by the mean difference between prompt conditions."
Sensory prompting: Adding explicit sensory cues (e.g., “see”, “hear”) to prompts to steer internal representations toward a modality. "We quantify how sensory prompting steers the representation of an LLM by comparing them to frozen unimodal encoders in vision and audio domains."
Semantic drift: The tendency of long generations to deviate from the original prompt’s meaning over time. "However, we note that alignment can decline as you continue to increase output tokens due to semantic drift from the prompt (Appendix~\ref{app:extension})."
Symbol-grounding problem: The challenge of how symbols gain intrinsic meaning without direct perceptual grounding. "This tension echoes the symbol-grounding problem, which asks how purely textual symbols can acquire intrinsic meaning without being anchored in direct perceptual experience \citep{harnad1990symbol}."
Unimodal encoder: A model trained on and specialized for a single modality (e.g., vision-only or audio-only). "We quantify how sensory prompting steers the representation of an LLM by comparing them to frozen unimodal encoders in vision and audio domains."
Visual framing: Instructing a LLM to adopt a visual perspective or to imagine seeing content before answering. "We then evaluate Qwen3-14B as the question-answering model under two prompt conditions: a neutral instruction and a visual framing, which explicitly asks the model to imagine seeing the caption before answering."
Visual Question Answering (VQA): A task where models answer questions about images; here evaluated via captions instead of images. "Visual cues allow LLMs to perform better on VQA in the text modality."
VQA without V: A setup where the model answers visual questions using only captions (text), not images. "we adopt the ``VQA without V'' setting from \citet{chan2025on,chai2024auroracap}."

View Paper Prompt View All Prompts

Open Problems

Explain why layer-averaged embeddings boost alignment relative to single-layer embeddings

Continue Learning

Authors (3)

Collections

Tweets

This paper has been mentioned in 13 tweets and received 1230 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

HackerNews

Words that make language models perceive (2 points, 0 comments)

alphaXiv

Words That Make Language Models Perceive (32 likes, 0 questions)

Words That Make Language Models Perceive (2510.02425v1)

Summary

Sensory Prompting Elicits Perceptual Representations in Text-Only LLMs

Introduction

Methodology

Main Findings

Sensory Cues Steer Representations Toward Sensory Encoders

Generative Representations Outperform Single-Pass Embeddings

Alignment Increases with Generation Length

Model Scale Enhances Sensory Alignment and Modality Separation

Sensory Language is Necessary and Sufficient for Alignment

Downstream Task Performance: Visual Question Answering in Text Space

Causal Manipulation and Layerwise Consistency

Generalization Across Models and Encoders

Implications and Theoretical Significance

Limitations and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they test their idea?

What did they find, and why does it matter?

What could this change or lead to?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Notes on Feasibility and Dependencies

Glossary

Open Problems

Continue Learning

Related Papers

Authors (3)

Collections

Tweets

HackerNews

alphaXiv