Papers
Topics
Authors
Recent
Search
2000 character limit reached

Neural Computers

Published 7 Apr 2026 in cs.LG and cs.AI | (2604.06425v1)

Abstract: We propose a new frontier: Neural Computers (NCs) -- an emerging machine form that unifies computation, memory, and I/O in a learned runtime state. Unlike conventional computers, which execute explicit programs, agents, which act over external execution environments, and world models, which learn environment dynamics, NCs aim to make the model itself the running computer. Our long-term goal is the Completely Neural Computer (CNC): the mature, general-purpose realization of this emerging machine form, with stable execution, explicit reprogramming, and durable capability reuse. As an initial step, we study whether early NC primitives can be learned solely from collected I/O traces, without instrumented program state. Concretely, we instantiate NCs as video models that roll out screen frames from instructions, pixels, and user actions (when available) in CLI and GUI settings. These implementations show that learned runtimes can acquire early interface primitives, especially I/O alignment and short-horizon control, while routine reuse, controlled updates, and symbolic stability remain open. We outline a roadmap toward CNCs around these challenges. If overcome, CNCs could establish a new computing paradigm beyond today's agents, world models, and conventional computers.

Summary

  • The paper introduces Neural Computers (NCs) as a unified trainable runtime that integrates computation, memory, and I/O into a single persistent neural state.
  • The authors demonstrate prototype implementations on CLI (CLIGen) and GUI (GUIWorld) platforms, achieving high text fidelity (PSNR 40.77, SSIM 0.989) and precise action conditioning.
  • The study highlights challenges in achieving complete neural computation, including limited native symbolic reasoning, long-horizon state persistence, and robust programmatic governance.

Neural Computers: Toward a Unified Learned Runtime

Introduction and Motivation

The paper "Neural Computers" (2604.06425) introduces Neural Computers (NCs) as a distinct machine abstraction in which a trainable neural system unifies computation, memory, and I/O in a single learned runtime state. This conceptualization extends beyond previous differentiable memory architectures (e.g., NTM, DNC) by positing that the model itself acts as the running computer, rather than as an agent interfacing with an external environment or simulator. The motivation is to bridge the gap between conventional stored-program computers—where compute, memory, and interfaces are modular and external to the model—and large neural systems, which so far function only as informatic overlays atop existing OS/application stacks or as environment/world predictors.

The core contribution is the proposal and instantiation of NCs as high-capacity, video-based generative architectures directly modeling terminal and desktop user interfaces, along with a technical and philosophical roadmap toward mature "Completely Neural Computers" (CNCs) that could replace conventional execution substrates for digital interaction, memory, and control. Figure 1

Figure 1: A schematic and visual overview of neural computers: a unified system that rolls out interface frames (top: terminal, bottom: desktop) conditioned on user prompts or actions, modeling the temporal-dynamic behavior of the computer as a learned latent state.

Neural Computer Abstraction

An NC is defined as a pair of neural functions (Fθ,Gθ)(F_\theta, G_\theta) parameterized by θ\theta, operating over a persistent latent runtime state hth_t that accumulates the full executable interface context (memory, buffer, etc.). At each step and for each interface (e.g., terminal or GUI):

ht=Fθ(ht1,xt,ut),xt+1Gθ(ht)h_t = F_\theta(h_{t-1}, x_t, u_t),\qquad x_{t+1} \sim G_\theta(h_t)

where xtx_t is the observable (frame/image/buffer), utu_t is the action or conditioning stream, and hth_t serves as the working memory and receiver of updates from all I/O. This setup is realized in the paper as a video-based model where hth_t corresponds to video latents (e.g., those produced by a VAE or diffusion video model).

NCs fundamentally diverge from Neural Turing Machines, Differentiable Neural Computers, and world models by not merely differentiating across external RAMs but instead learning and encapsulating the execution and state transition of the computer itself as a single, persistent, trainable state.

Prototypes: Terminal (CLIGen) and Desktop (GUIWorld) NCs

The implementation leverages SOTA video generative models (notably Wan2.1 [wan2025wan]), augmented to synchronize text, action streams, and video for two major human-computer interaction surfaces:

  1. CLIGen (Terminal NC):
    • Models command-line interface sessions using rendered terminal videos synchronized with prompt/action metadata.
    • Two datasets: "General" (diverse, public terminal traces; open world) and "Clean" (deterministic, script-driven, Dockerized, for precise control and ablation).
  2. GUIWorld (GUI NC):
    • Models desktop environments, tracking mouse/keyboard actions and rendering future RGB frames.
    • Data collection encompasses both random and goal-directed traces, with precise action conditioning and cursor location supervision. Figure 2

      Figure 2: Future interface frames rolled out by NCs for both CLI and GUI settings, demonstrating learned terminal and GUI dynamics over action streams.

Key Experimental Findings

CLI: CLIGen

  • Text Rendering Fidelity: At practical font sizes (e.g., 13 px), NCs maintain high-fidelity, readable interface state (PSNR 40.77, SSIM 0.989).
  • Prompt Conditioning: The level of specificity in conditioning text (from semantic to detailed captions) directly modulates text-to-pixel alignment, with detailed prompt captions yielding a PSNR gain of nearly 5 dB.
  • Character-Level Accuracy: After 40–60k steps of training, video model outputs achieve character accuracy up to 0.54 and exact-line accuracy up to 0.31 (as measured by OCR on generated frames)—implying that NCs can produce text-consistent terminal rollouts.
  • Symbolic Computation Probes: Baseline arithmetic accuracy is poor (4%), indicating weak native symbolic reasoning. However, "reprompting" (improved conditioning/static inclusion of solutions) boosts symbolic probe scores to 83%, evidencing that current NCs act as powerful, steerable renderers and thin interfaces, but not as native reasoners.
  • Learning Trajectory: Global perceptual reconstruction metrics plateau early, with additional fine-tuning yielding diminishing returns. Figure 3

    Figure 3: CLIGen VAE reconstructions at varying font sizes, demonstrating NCs' robustness in rendering visually accurate terminal states, tightly dependent on font size.

GUI: GUIWorld

  • Supervised Action Traces: A small set of high-quality, goal-directed interactions outperforms much larger random datasets for learning fine action-response structure.
  • Precise Cursor Control: Explicit pixel-level cursor supervision (via SVG overlays) is necessary to achieve near-perfect alignment (98.7% accuracy); coordinate-based supervision alone is inadequate.
  • Action Conditioning Schemes: The depth and method of action signal injection (external, contextual, residual, or internal) in the diffusion stack strongly affects post-action visual fidelity. Internal conditioning (action cross-attention within transformer blocks) yields best action-to-frame correspondence and lowest frame distortion (FVD 14.5, SSIM 0.863).
  • Action Representation: Structured API-like meta-actions outperform raw event streams in action-driven fidelity, but the injection locus is a larger determinant of performance than encoding granularity. Figure 4

    Figure 4: Explicit cursor reference conditioning in GUIWorld: original desktop frames, binary cursor masks, and cursor-only references, highlighting mechanisms for achieving precise pointer control.

Roadmap to Completely Neural Computers (CNCs)

The paper rigorously defines CNCs as the completion of the NC abstraction with the following properties:

  • Turing Completeness: Universal computational expressiveness.
  • Universal Programmability: Routines can be installed and reused by input sequences themselves (not only trigger one-shot behaviors).
  • Behavior Consistency: Model behaviors are reproducible unless explicitly reprogrammed; persistent functions are invoked without silent drift.
  • Machine-Native Semantics: The underlying runtime leverages and exposes the inherent representational and compositional strengths of neural architectures—not merely mimicking symbolic APIs.

A strict operational reading is that CNCs should support persistent, composable function installation/invocation, be annotatable/auditable for behavioral change, and demonstrate robust long-horizon consistency and governance.

The roadmap highlights the need for architectural advances to support unbounded effective memory, conditional update mechanisms, and compositional semantic installation. The evidence so far is that NCs can already realize some shell primitives (I/O alignment, cursor control) but fall short on intrinsic symbolic reasoning, stable long-horizon behavior, and routine re-use. Figure 5

Figure 5: A comparative schematic showing the evolution from direct human use of conventional computers, to agent-mediated interfaces, to unifying all runtime roles within a learned NC.

Contrasts with Conventional Computers, World Models, and Agents

  • Conventional Computers: Explicit separation of program, memory, and I/O; semantics grounded in discrete, local symbolic manipulation.
  • World Models: Capture only the predictive interface, not the executable state.
  • AI Agents: Mediate explicit computers, do not subsume the runtime; rely on external execution substrates and usually cannot persist or re-use capability within the model.
  • Neural Computers: Collapse all boundaries; all running state, memory, and interface behaviors are embodied in a persistent, updateable neural latent state.

The implication is both architectural (the emergence of a new candidate for general-purpose, user-programmable computers) and practical (future software stacks may be built as configuration or programming over an NC substrate, using learned programming-language semantics).

Limitations and Open Challenges

  • Symbolic Generalization: NCs remain unreliable at symbolic manipulation and arithmetic unless answers are "leaked" through conditioning (prompt-driven recapitulation rather than computation).
  • Long-Horizon State: Sustaining and governance of persistent routines, behavior audit/logging, and defect rollbacks are unaddressed at present.
  • World Model Capacity: While recent video models (Sora2, Veo3.1) show qualitative jumps, architectural advances (e.g., parameter growth, dynamic memory, function disentanglement) will be required to achieve CNC characteristics.
  • Separation of Execution and Update: Robust mechanisms to separate dynamic execution from explicit update/reprogramming must be developed, possibly via gating (e.g., LSTM-like or hybrid mechanisms).

Outlook and Theoretical Implications

NCs propose an alternate foundation for general computation where the installation, composition, and evolution of capabilities are realized directly via modification of continuous neural state and input-driven specifications, overhauling both OS-level management and conventional programming interfaces. As all observable I/O, memory, and even updates become endogenous to the neural runtime, software development shifts from code editing to trace curation, prompt programming, and interaction demonstration—a new paradigm strongly favored by the abundance of interaction data over hand-written code.

The roadmap toward CNCs involves surmounting formidable stability, symbolic, and governance gaps. If solvable, this would mark a shift from neural computation as an adjunct to digital computers to the central runtime substrate of future digital interfaces.

Conclusion

The "Neural Computers" framework (2604.06425) demonstrates the technical feasibility of instantiating neural architectures as learned runtimes capable of unifying compute, memory, and I/O. The presented open-loop video-based prototypes (CLIGen, GUIWorld) establish initial reference points for interface fidelity and action-conditioned control in both CLI and GUI domains. However, symbolic reliability, persistent function installation, and programmatic governance remain open challenges for advancing to the full CNC abstraction. The work serves as both a technical milestone and a conceptual platform for re-examining the possibilities of general-purpose computation in the neural era, highlighting new axes for architectural, theoretical, and practical research at the foundations of AI and human-computer interaction.

Whiteboard

Explain it Like I'm 14

Neural Computers — A simple explanation

1) What is this paper about?

This paper asks a bold question: Can one neural network act like a whole computer by itself? The authors call this idea a Neural Computer (NC). Instead of having separate parts for thinking (compute), remembering (memory), and showing/receiving things (input/output), an NC tries to learn all of these inside one “brain” made of neural network weights. The long-term vision is a Completely Neural Computer (CNC) that works like a general-purpose computer, but is learned end-to-end.

2) What questions were the researchers trying to answer?

In plain terms, they explored:

  • Can a neural network learn to “be” the running computer, not just a tool that uses one?
  • Can it keep track of what’s on the screen, react to what the user types or clicks, and predict what the screen will look like next?
  • Are early “computer-like” skills—such as aligning with what’s on the screen and responding correctly to short sequences of actions—learnable from real interface recordings?
  • How far can it go with reasoning (like math) without extra tricks?

3) How did they test the idea? (Methods in everyday language)

They turned the computer interface into a kind of “video world” the model could learn.

  • Think of a computer screen like a movie. Each frame is a screenshot. When a user types or clicks, the screen changes. The model learns to predict the next screen based on the current screen and the user’s action—like a smart video game engine that knows what happens when you press keys or move the mouse.
  • The model keeps an internal memory (they call it a “latent runtime state,” like a notebook in its head) so it remembers what’s going on across frames.

They built two prototypes:

  • CLIGen (for terminals/command lines): It watches terminal videos and learns what happens when you type commands like “python” or “ls.”
  • GUIWorld (for desktops/graphical apps): It watches desktop recordings with mouse and keyboard logs and learns how the interface responds to clicks, hovers, and menu actions.

How they trained it (simplified):

  • They collected lots of recorded sessions:
    • For CLIGen (General): public terminal recordings (asciinema) with text, timing, and frames.
    • For CLIGen (Clean): scripted, repeatable terminal demos (so there’s less randomness).
    • For GUIWorld: desktop screen videos aligned with mouse and keyboard actions.
  • They used a strong video generator as the base (a diffusion video model; you can think of it as a careful painter that builds each frame step-by-step).
  • A “VAE” compressed each frame into a short code (like zipping images), which the model used as its internal memory.
  • “Conditioning” means giving the model extra guidance, like a caption describing the session, or the actual list of user actions (keys clicked). This helps it know what to draw next.
  • They judged results with:
    • Image similarity scores (PSNR/SSIM) for “does it look like the real thing?”
    • OCR (text reading) to check if on-screen text is correct, letter by letter and line by line.
    • Simple math tasks in the terminal to test basic reasoning.

4) What did they find, and why does it matter?

Key findings:

  • Early “computer-like” skills are learnable:
    • The model stayed aligned with what was on the screen and handled short action sequences well.
    • In the terminal, it learned realistic details: scrolling, prompts, wrapping lines, resizing windows, and rendering readable text at normal font sizes.
    • In the desktop GUI, it learned short-horizon control, like moving the pointer, hover and click feedback, and opening menus.
  • Clearer instructions help a lot:
    • Detailed, literal captions (prompts that say exactly what’s on the screen) improved how accurately the model rendered terminal content.
  • Training improves fast at first, then levels off:
    • Visual quality metrics rose quickly and then plateaued, suggesting that beyond a point, better or more informative data might matter more than just longer training.
  • It can produce correct on-screen text fairly well:
    • Character-level accuracy (checked by OCR) climbed to roughly half of characters correct and about a third of lines exactly correct in controlled tests—good signs for usable, readable terminals.
  • Reasoning (like math) is still weak without help:
    • On math tasks shown through the terminal, the model did poorly (around 4% correct), unless it was given stronger hints or more specific prompts. With better prompts (reprompting), accuracy jumped to 83%—showing it’s very steerable by instructions, but not yet a reliable calculator on its own.

Why this matters:

  • These are the first steps toward a computer that’s fully “in” a neural network—one that can hold state, react to actions, and render the next screen, all inside learned weights. That’s a new way of thinking about computers and could eventually simplify how systems are built and controlled (e.g., steer them with plain language and demonstrations).

5) What’s the big picture? (Implications and impact)

This work hints at a future where computers could be learned, not just engineered piece by piece. If the vision of a Completely Neural Computer (CNC) comes true, you might:

  • Program and control a system through examples and language, rather than writing traditional code for every part.
  • Have a unified “brain” that handles memory, computation, and the user interface seamlessly.

But there are big challenges ahead:

  • Long, complex tasks (long-horizon reasoning) and reliable symbolic work (like math and logic) are still hard.
  • Reusing capabilities safely and keeping behavior consistent over time needs careful design (runtime governance).
  • New neural architectures may be needed to go beyond video-based prototypes.

In short, the paper shows that simple “runtime” skills—like keeping the screen consistent and responding to clicks/keys—can be learned today. That’s encouraging progress toward the bigger goal of fully neural computers, but solid reasoning and reliability will require more advances.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of what remains missing, uncertain, or unexplored based on the paper’s current scope and results:

  • No closed-loop evaluation: all rollouts are open-loop from logged conditioning streams; there is no environment-in-the-loop test of interactive stability, error recovery, or task completion over long horizons.
  • Unverified “executable state” semantics: the latent state h_t (realized as video latents z_t) is not shown to carry structured, addressable, or interpretable program state beyond correlational dynamics.
  • Absent evidence of Turing completeness or universal programmability: the prototypes do not demonstrate execution of arbitrary programs, conditional branching with reliable state, or general-purpose control flow.
  • Determinism and behavior consistency are unproven: diffusion sampling introduces stochasticity; there is no method to guarantee repeatable behavior “unless explicitly reprogrammed.”
  • No runtime governance mechanism: the paper highlights governance as a CNC criterion but does not propose concrete controls for inspection, auditing, interruption, sandboxing, or policy enforcement.
  • Long-horizon reasoning and control remain untested: experiments focus on short-horizon I/O and local control (e.g., pointer hover/click); there is no quantitative study of multi-step planning, multi-window workflows, or extended sessions.
  • Native symbolic computation is weak: arithmetic probes show near-zero to low accuracy without reprompting; it remains unclear how to endow stable, native symbolic operations within the NC backbone.
  • Conditioning confounds “reasoning”: large gains on arithmetic (4%→83%) come from reprompting/recaptioning; evaluation does not disentangle answer-conditioned rendering from internal computation.
  • RL vs. conditioning vs. architecture: the paper does not resolve whether reliable symbolic computation requires reinforcement learning, stronger conditioning, architectural changes (e.g., discrete memory), or their combination.
  • Metrics are misaligned with interactivity: reliance on PSNR/SSIM and OCR accuracy does not capture control competence, task success, or state consistency; no standardized NC benchmarks are proposed.
  • Early training plateaus suggest objective/data limits: reconstruction metrics saturate quickly; it is unknown which objectives (e.g., control-aware, consistency, or state-tracking losses) would push beyond early plateaus.
  • GUIWorld lacks quantitative control metrics: while pointer dynamics look coherent qualitatively, there are no measurements of click accuracy, target acquisition latency, hit rates, or robustness to UI variability.
  • Generalization across apps, themes, and resolutions is unproven: models are sensitive to font size and fixed rendering setups; robustness to unseen applications, DPI changes, color themes, and window geometries is not evaluated.
  • Closed-loop safety and error handling are unaddressed: the NC’s response to unexpected system states, latency spikes, partial renders, or asynchronous events (e.g., network errors) is not studied.
  • Memory capacity and retention properties are unknown: no experiments quantify working-memory length, catastrophic forgetting, state drift over long rollouts, or the benefits of explicit memory mechanisms.
  • No explicit representation for structured buffers: the NC does not read/write terminal buffers or OS state as privileged inputs; the trade-off between pixel-only modeling and structured state access is unexplored.
  • Scalability of the video-based substrate is unclear: costs, latency, and throughput for high-resolution desktops, multi-monitor setups, and low-latency interaction are not profiled or optimized.
  • Ambiguity in action modeling: the paper explores action injection/encoding but does not isolate how action representations affect controllability, credit assignment, or generalization across devices.
  • Lack of causal ablations: it is unclear which components (VAE choice, CLIP features, caption styles, action encoders) are necessary/sufficient for I/O alignment and control versus mere visual fidelity.
  • Dataset biases and leakage risks: CLIGen (General) uses LLM-generated captions; CLIGen (Clean) includes correct answers in ~50% of arithmetic captions; these may bias results and blur the line between reasoning and conditioning.
  • No comparison to agent baselines: there is no head-to-head evaluation against LLM-based OS agents or programmatic simulators on identical tasks, leaving unclear where NCs help or lag.
  • Inadequate treatment of error accumulation: the stability of latent state through rollouts, compounding rendering errors, and cursor/geometry drift are not quantitatively analyzed.
  • Programming model is unspecified: how users “program” an NC (languages, APIs, constraints), how to modularize skills, and how to perform upgrades without catastrophic behavior shifts remain open.
  • Reprogramming vs. learning updates: mechanisms to patch behaviors, version models, and ensure backward-compatible “software updates” are not defined.
  • Verification and debugging are unsolved: there is no pathway for formal verification, unit testing, or introspection of latent state to trace and fix failures.
  • Security and privacy are unaddressed: risks of latent-state retention of secrets, prompt injection through UI pixels, or unsafe actions in a closed-loop desktop are not scoped or mitigated.
  • No clear path to discrete, addressable memory: the prototypes do not integrate differentiable/external memory; whether latent-only dynamics can support robust algorithmic manipulation is an open question.
  • Action-selection is not learned: the NC renders conditioned on provided actions; it does not generate actions or policies, leaving open how a CNC would decide and execute next steps autonomously.
  • Handling asynchronous and concurrent processes is untested: interrupts, background tasks, and race conditions in real OS environments are not modeled or evaluated.
  • Transfer and reuse of capabilities are unclear: how to compose skills across CLI and GUI domains, share parameters, or amortize learning across tasks is left for future work.
  • Data release, reproducibility, and standardization: it is unclear whether datasets, prompts, and evaluation harnesses will be released to enable reproducible CNC benchmarking.
  • Energy and hardware constraints: the prototypes require substantial GPU time; feasibility for real-time, low-latency “computing” on commodity hardware is not demonstrated.
  • Alternative substrates and architectures: while the paper hints that video models are a stopgap, it does not explore or prototype architectures purpose-built for NC/CNC properties (e.g., hybrid neural-discrete systems).

Practical Applications

Overview

This paper introduces “Neural Computers” (NCs): neural systems that unify computation, memory, and I/O in a single learned runtime state and demonstrates two practical, video-model-based prototypes:

  • CLIGen for command-line interfaces (terminal frames + text conditioning).
  • GUIWorld for desktop GUIs (screen frames + synchronized mouse/keyboard).

The prototypes learn early runtime primitives—screen I/O alignment, coherent short-horizon control, and character-accurate rendering—from raw interface input/output without privileged access to program state. Below are actionable applications and what they depend on, grouped by time horizon.

Immediate Applications

Below are deployable-now use cases based on demonstrated capabilities (high-fidelity interface rendering, short-horizon action response, text-to-pixel alignment, synchronized data pipelines, and conditioning-driven control).

  • Bold title: sector(s)
    • Potential tools/workflows
    • Assumptions/dependencies
  • Bold Prompt-to-terminal screencast generation from natural language: software, education, content creation
    • Potential tools/workflows: “Prompt-to-screencast” doc builder; product/tutorial video auto-generator; developer marketing assets; reproducible runbook visuals from captions.
    • Assumptions/dependencies: Best on well-covered CLI patterns and sensible font sizes (≈13 px); limited long-horizon logic; open-loop generations (no live environment feedback).
  • Bold Synthetic UI/CLI dataset generation for vision/OCR and UI parsers: AI/ML, software QA
    • Potential tools/workflows: Large-scale terminal/GUI video corpora for training OCR, UI element detectors, layout parsers; controlled font/palette ablations to test robustness.
    • Assumptions/dependencies: Data licensing/privacy; fidelity hinges on render consistency and timing; risk of domain shift outside trained themes/fonts.
  • Bold Short-horizon UI automation prototyping (micro-automation): enterprise RPA, productivity software
    • Potential tools/workflows: “Neural pilot” that simulates/validates hover/click menus, focus changes, and window transitions before deploying brittle RPA scripts; macro recording/rehearsal in a sandbox.
    • Assumptions/dependencies: Works in controlled desktops with synchronized action traces; short-horizon only; open-loop (no guaranteed success criteria).
  • Bold Sandboxed training substrates for agent research: AI/embodied agents
    • Potential tools/workflows: Offline imitation pretraining on synchronized (frame, action) logs; action-conditioned rollouts for curriculum learning; ablations on action encodings.
    • Assumptions/dependencies: Current prototypes are not closed-loop environments and have weak native symbolic reasoning; reward signals and evaluators must be externally defined.
  • Bold CLI/GUI regression and rendering QA: software QA, DevOps
    • Potential tools/workflows: Character-level accuracy and OCR-based diffs to detect rendering regressions; terminal “golden video” comparisons; palette/geometry change detectors.
    • Assumptions/dependencies: Sensitive to font/themes; global metrics (PSNR/SSIM) plateau and can be misleading—prefer OCR/line-level metrics.
  • Bold Documentation and runbook validation: DevOps/IT operations
    • Potential tools/workflows: Turn runbooks into expected terminal rollouts and compare with recorded sessions; auto-flag misalignments (cursor position, line breaks, prompt wrapping).
    • Assumptions/dependencies: Deterministic scripts (e.g., vhs/Docker) work best; actual program correctness is not verified—visual alignment only.
  • Bold Safe, simulated REPL practice for learners: education, security
    • Potential tools/workflows: “SafeShell” sandboxes that show likely terminal outputs without executing code; interactive labs for shell fundamentals; visual feedback on edits/cursor mechanics.
    • Assumptions/dependencies: Outputs are approximations; not for teaching deep program semantics; must clearly disclose that results are simulated.
  • Bold Conditioning-enhanced rendering for support workflows: customer support, CX
    • Potential tools/workflows: Step-by-step troubleshooting videos generated from structured prompts; reprompting/LLM helps compute answers which the NC faithfully renders as UI/CLI outputs.
    • Assumptions/dependencies: Heavily reliant on accurate conditioning (reprompts/LLM-planned steps); risk of hallucinated steps; compliance and liability considerations.
  • Bold HCI/accessibility prototyping via pointer/interaction modeling: HCI, accessibility
    • Potential tools/workflows: Assess hover/click responses, focus indicators, pointer dynamics; visualize impact of larger fonts/high-contrast themes on interaction flows.
    • Assumptions/dependencies: Requires datasets with consistent themes and annotated interactions; currently focused on short-horizon transitions.
  • Bold Evaluation methodology for “reasoning vs conditioning” in video models: academia
    • Potential tools/workflows: Arithmetic probes, reprompting protocols, action-encoding ablations, and OCR-based metrics shared as a benchmark suite; reproducible data-engine recipes.
    • Assumptions/dependencies: Compute-heavy (thousands of H100 hours reported); careful interpretation needed—conditioning can mask lack of native symbolic computation.

Long-Term Applications

These require further research and engineering—especially long-horizon reliability, native symbolic processing, stable capability reuse, closed-loop interaction, and runtime governance.

  • Bold Completely Neural Computers (CNCs) as learned OS/runtimes: software, hardware
    • Potential tools/workflows: “Neural OS” that unifies compute, memory, and I/O in a single learned substrate; applications as programs over latent runtime state; weight-level “reprogramming.”
    • Assumptions/dependencies: Must achieve Turing completeness, universal programmability, behavior consistency, and clear advantages over conventional stacks; verifiability and debuggability are open challenges.
  • Bold End-to-end desktop/RPA agents that are the interface: enterprise, finance, healthcare admin
    • Potential tools/workflows: Agents that internally simulate and execute workflows across heterogeneous apps (billing, claims, back-office ops) with robust long-horizon control.
    • Assumptions/dependencies: Closed-loop stability, error recovery, and explicit runtime governance; sector-specific compliance, audit, and data isolation.
  • Bold Neural app development and “model-native” app stores: software industry
    • Potential tools/workflows: Apps specified by prompts/programs over NC state; packaging, versioning, testing, and deployment pipelines for latent programs; differential updates to behaviors.
    • Assumptions/dependencies: Toolchains for reproducibility, state introspection, and compatibility; licensing/IP around weights and datasets.
  • Bold Secure, ephemeral “latent-state” sandboxes for malware detonation or safety-critical testing: security, policy
    • Potential tools/workflows: High-fidelity neural simulations to observe potential effects of suspicious binaries/UI workflows without touching real systems.
    • Assumptions/dependencies: Faithful causal modeling of side effects; determinism and observability guarantees; strong isolation and audit trails.
  • Bold Universal UI simulators for training across ecosystems: education, enterprise L&D
    • Potential tools/workflows: Broad coverage of productivity suites, EHRs, ERP/CRM, and bespoke tools for scalable onboarding and recurrent training; scenario-based assessments.
    • Assumptions/dependencies: Massive, diverse, and licensed datasets; realistic timing/latency models; continual updates to match evolving software.
  • Bold Safety-critical UI co-pilots (e.g., EHR, trading terminals): healthcare, finance
    • Potential tools/workflows: Neural co-pilots that preview and validate UI actions, provide “dry-run” visualizations, and enforce guardrails by comparing expected vs actual frames before committing.
    • Assumptions/dependencies: Verified alignment with domain policies; strict privacy; independent safety monitors; certification and post-hoc explainability.
  • Bold Model-in-the-loop UX design and rapid A/B iteration: HCI, product management
    • Potential tools/workflows: Generate plausible user-interface dynamics at scale, stress-test flows, and iterate on micro-interactions without live code; “design-to-simulation” pipelines.
    • Assumptions/dependencies: Requires behavioral realism and user-model integration; correlations with real user metrics must be validated.
  • Bold Device-free “neural desktop streaming”: consumer software, edge/cloud
    • Potential tools/workflows: Deliver applications as streamed neural rollouts where the NC handles rendering and input mapping; thin clients with minimal local compute.
    • Assumptions/dependencies: Low-latency inference, cost-efficient video diffusion or successor architectures, energy constraints, and robust input-to-state alignment.
  • Bold LLM–NC co-processors for agentic systems: AI platforms
    • Potential tools/workflows: LLMs do planning/symbolic reasoning; NCs execute and render UI effects; standardized interfaces for action injection, state introspection, and policy enforcement.
    • Assumptions/dependencies: Reliable bridges between text plans and action-conditioning streams; monitoring for drift and inconsistent behaviors.
  • Bold Regulatory and standards frameworks for neural runtimes: policy, governance
    • Potential tools/workflows: Audit logging for latent-state transitions, conformance tests for behavior consistency, provenance and data lineage requirements, red-team protocols.
    • Assumptions/dependencies: Methods to safely introspect and govern learned runtime state; consensus on safety and accountability for model-driven execution.

Cross-cutting dependencies and risks

  • Compute and cost: Training and inference for high-capacity video models are resource-intensive; deployment hinges on more efficient architectures or hardware acceleration.
  • Data and licensing: High-quality, synchronized frame/action/text logs are essential; privacy and IP constraints must be addressed.
  • Generalization and reliability: Current strength is short-horizon control and rendering; long-horizon consistency, error recovery, and native symbolic reasoning remain open.
  • Evaluation clarity: Conditioning (e.g., reprompting/LLM assistance) can inflate apparent “reasoning.” Benchmarks must distinguish native computation vs conditioning-assisted performance.
  • Environment control: Fidelity depends on fonts, palettes, timing, and deterministic rendering; variability across real-world setups degrades performance.

Glossary

  • AdamW: An optimizer that decouples weight decay from the gradient-based update to improve training stability. "Optimization uses AdamW (learning rate 5×1055\times10^{-5}, weight decay 10210^{-2}), bfloat16 precision, and gradient clipping at 1.0."
  • Action-conditioned rendering: Generating future frames conditioned on actions so the visuals reflect user inputs. "GUIWorld captures desktop RGB with synchronized mouse/keyboard traces to validate action-conditioned rendering and control on GUIs."
  • Action injection: The design choice of how to feed action signals into a model’s inputs to influence predictions. "we evaluate standard world-model designs across action injection, action encoding, and data quality."
  • ANSI-faithful decoding: Accurate interpretation of terminal control sequences following ANSI standards during replay or rendering. "The asciinema stack records and replays terminal sessions with synchronized timing and ANSI-faithful decoding."
  • Auxiliary heads: Additional model branches used to encode or decode side information (e.g., prompts or actions) alongside the main output. "Auxiliary heads can encode and decode prompts, buffers, or action traces, shifting functionality that would traditionally live in OS queues, device drivers, and UI toolkits into latent-state dynamics."
  • bfloat16: A 16-bit floating-point format with a wider exponent than IEEE FP16, often used to save memory while maintaining dynamic range. "Optimization uses AdamW (learning rate 5×1055\times10^{-5}, weight decay 10210^{-2}), bfloat16 precision, and gradient clipping at 1.0."
  • CLIP image encoder: A vision encoder from CLIP that maps images to embeddings aligned with text semantics. "a CLIP image encoder~\citep{radford2021learning} extracts visual features from the same frame"
  • Completely Neural Computer (CNC): The mature form of a neural computer that is fully learned and meets strong computational and programming criteria. "The long-term target is a Completely Neural Computer (CNC), the mature, general-purpose realization of this machine form"
  • Conditioning stream: A time-indexed sequence of inputs (e.g., actions, prompts) provided to a model to steer predictions. "the input sequence {ut}\{u_t\} is referred to as a conditioning stream."
  • Decoupled cross-attention: A cross-attention mechanism that separately injects conditioning context (e.g., text and image features) into the model. "Decoupled cross-attention injects the joint caption and first-frame context derived from the CLIP and text features."
  • Diffusion noise: The noise injected into the diffusion process during training or sampling in diffusion models. "these conditioning features are concatenated with diffusion noise"
  • Diffusion-style video models: Generative video models based on diffusion processes operating in latent or pixel space. "z for VAE/video latents used in diffusion-style video models (e.g., \Cref{section:impl-guiworld})."
  • Diffusion transformer: A transformer architecture used within diffusion models to iteratively denoise latent representations. "the diffusion transformer acts as the state-update map"
  • Differentiable Neural Computer: A neural architecture with differentiable external memory enabling learned algorithmic behavior. "Neural Turing Machine / Differentiable Neural Computer line~\citep{graves2014neural,graves2016hybrid}"
  • Gradient checkpointing: A memory-saving technique that recomputes intermediate activations during backpropagation to train larger models. "Training uses gradient checkpointing and applies dropout 0.1 to the prompt encoder, CLIP, and VAE modules."
  • Gradient clipping: A stabilization method that limits the norm or value of gradients to prevent exploding updates. "Optimization uses AdamW (learning rate 5×1055\times10^{-5}, weight decay 10210^{-2}), bfloat16 precision, and gradient clipping at 1.0."
  • Image-to-video (I2V): Generating future video frames from a starting image (and possibly text) as conditioning. "Following the Wan2.1 image-to-video (I2V) design, these conditioning features are concatenated with diffusion noise, projected through a zero-initialized linear layer, and processed by a DiT stack."
  • I/O alignment: Maintaining consistency between a model’s internal state and its observable inputs/outputs over time. "most notably I/O alignment and short-horizon control."
  • I2V sampling schedule: The prescribed sequence of denoising steps used when sampling from an image-to-video diffusion model. "under the original Wan2.1 I2V sampling schedule, without additional binary masks or periodic reseeding."
  • Instruction set architecture: The abstract interface (instructions, registers, etc.) between software and hardware for a computer. "they are commonly abstracted as random-access machines with an instruction set architecture"
  • Levenshtein distance: An edit distance metric measuring the minimum number of insertions, deletions, or substitutions to transform one string into another. "Character accuracy uses the Levenshtein distance between concatenated ground-truth and generated texts."
  • Monotonic clock: A time source that only moves forward, used for consistent timing and synchronization. "Frames, text buffers, and keyboard-event logs share a single monotonic clock."
  • Neural Computer (NC): A neural system whose single learned state unifies computation, memory, and I/O as a running computer. "We term this abstraction a Neural Computer (NC): a neural system that unifies computation, memory, and I/O in a learned runtime state."
  • Neural Turing Machine: A neural architecture that augments a controller with differentiable memory addressing to mimic algorithmic behavior. "Neural Turing Machine / Differentiable Neural Computer line~\citep{graves2014neural,graves2016hybrid}"
  • OCR: Optical Character Recognition; extracting text from images. "OCR accuracy versus training."
  • Open-loop evaluation: Assessing a model by feeding pre-recorded inputs without interactive feedback from the environment. "evaluation remains open-loop rather than closed-loop interaction with a live environment."
  • PSNR: Peak Signal-to-Noise Ratio; a perceptual metric for reconstruction fidelity between images or frames. "PSNR/SSIM plateau around 25k steps (Figure~\ref{fig:cligen-long-train})"
  • Random-access machine: An abstract computational model assuming constant-time memory access used in algorithmic analysis. "they are commonly abstracted as random-access machines with an instruction set architecture"
  • REPL: Read–Eval–Print Loop; an interactive programming environment that reads inputs, evaluates them, and prints results. "interactive REPL usage"
  • Reprompting: Modifying or strengthening prompts to steer a model toward better performance on a task. "reprompting improves symbolic probes (4\%\rightarrow83\%; Figure~\ref{fig:cligen-exp6})"
  • Short-horizon control: Control over immediate, near-future actions or responses rather than long-range planning. "most notably I/O alignment and short-horizon control."
  • T5: A transformer-based text encoder/decoder model used for embedding or generation tasks. "a text encoder (e.g., T5~\citep{raffel2020exploring}) embeds the caption."
  • Turing complete: Capable of performing any computation given enough time and memory. "(i) Turing complete, (ii) universally programmable, (iii) behavior-consistent unless explicitly reprogrammed,"
  • Universally programmable: Able to be programmed to implement any desired behavior within its computational limits. "(i) Turing complete, (ii) universally programmable, (iii) behavior-consistent unless explicitly reprogrammed,"
  • Update-and-render loop: A pattern where a system updates its internal state from inputs and then renders the next observable output. "folds these roles into an update-and-render loop."
  • Variational Autoencoder (VAE): A probabilistic latent-variable model that encodes inputs into a latent distribution and decodes samples back to data space. "The VAE encodes and decodes terminal frames."
  • World models: Learned models of environment dynamics used for prediction, planning, or imagination. "World models~\citep{ha2018world} show that neural networks can internalize environment dynamics and support predictive imagination"
  • Zero-initialized linear layer: A linear layer whose weights are initialized to zero, often to control early training behavior. "projected through a zero-initialized linear layer"

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 14 tweets with 447 likes about this paper.

HackerNews

  1. Neural Computers (2 points, 0 comments)