CLIGen Terminal NC: Neural Computer Paradigm
- CLIGen (Terminal NC) is a neural computer model that unifies computation, memory, and I/O by learning terminal behaviors directly from raw CLI traces via latent diffusion.
- The system employs a latent video diffusion model with DiT transformer blocks that sequentially apply self-attention, text, and image cross-attention to generate precise terminal frame sequences.
- Empirical results show improved text-to-pixel fidelity and OCR accuracy, while highlighting challenges in arithmetic stability and long-horizon coherence for neural computing.
CLIGen (Terminal NC), as introduced in "Neural Computers" (Zhuge et al., 7 Apr 2026), is an instantiation of the Neural Computer (NC) paradigm—a machine form where computation, memory, and I/O are unified in a learned runtime state. In contrast to conventional computers, which execute explicit, structured programs, or agent/world-model architectures that operate over explicit environment abstractions, Terminal NC aims for the model itself to embody the running computer. The long-term goal is the Completely Neural Computer (CNC): a general-purpose, stably executing, reprogrammable, and durably updateable system that can reuse capabilities. CLIGen provides both the architectural framework and empirical validation for learning foundational computer-like behaviors directly from raw command-line interface (CLI) I/O traces without instrumented program state.
1. Model Architecture: Latent-Diffusion Terminal Runtime
The core of CLIGen is a video generation backbone based on the Wan 2.1 latent video diffusion model. This consists of a variational autoencoder (VAE) encoder and decoder mapping 80×24 terminal frames to and from a 64×24 latent grid , combined with a DiT-style diffusion transformer. Each DiT block sequentially applies self-attention, text-prompt cross-attention, image-prompt cross-attention, then a feedforward network (FFN).
Conditioning channels include a text prompt (expressed in semantic, regular, or detailed style and encoded via a T5 model to ) and the first frame (encoded both by VAE encoder to and by a CLIP image encoder to ). The generation pipeline encodes the prompt and first frame, diffuses latent variables stepwise with additive noise and DiT denoising, and decodes the latent series into pixel frames 0.
2. Mathematical Formalization and Training Objectives
Given trajectory tuples 1, where 2 is the starting terminal frame, 3 is the text instruction, and 4 encodes user actions (tokenized), the system defines a latent runtime state 5. The state is iteratively updated by a function 6:
7
and rendered to output with 8:
9
Within the diffusion paradigm, 0 and 1 correspond to the DiT transformer and the VAE decoder respectively.
The primary loss is the latent diffusion objective:
2
For CLIGen Clean (with explicit user actions), an auxiliary action-alignment loss is optionally used:
3
where 4 is a learned action prediction head.
3. Decoupled Cross-Attention and I/O Alignment
I/O alignment between prompt tokens and terminal video frames is achieved via "decoupled cross-attention" in DiT blocks. Starting with hidden 5, the computation at each layer is:
- 6
- 7
- 8
- 9
Empirical evaluation demonstrates that detailed token-by-token captions yield an improvement of nearly 5 dB PSNR in text-to-pixel fidelity compared to less literal styles, substantiating the benefit of close alignment between linguistic and visual representations.
4. Training Data, Preprocessing, and Evaluation
CLIGen leverages two purpose-built datasets:
| Dataset | Source/Collection | Scale | Features |
|---|---|---|---|
| CLIGen (General) | Asciinema .cast logs, replayed and segmented | 823k clips (~1,100h) | Three styles of captions (semantic, regular, detailed), public TTYs, annotated at 15 FPS |
| CLIGen (Clean) | Scripted Docker traces, vhs | ~128k traces | Explicit action alignment (Sleep, Type, Enter), deterministic font/color/pacing |
All video clips are temporally aligned to action events and meticulously filtered/redacted for sensitive content.
Evaluation metrics include VAE reconstruction PSNR (40.77 dB, 0.989 SSIM at font 13px), text-prompt to video fidelity (detailed: 26.89 dB), and frame-level OCR accuracy rising from 0.03 to 0.54 across 60k optimization steps. Arithmetic execution is severely limited in base CLIGen (4%), but inclusion of direct answer hints in the prompt ("reprompting") increases accuracy to 83% on math tasks.
5. Learned Command-Line Execution and Qualitative Behaviors
Terminal NCs can synthesize core computational primitives directly in video space. The per-step latent 0 encodes the runtime terminal state, maintaining buffer history and rendering behaviors such as buffer scrolling, prompt wrapping, cursor movements, and color dynamics. Rollout examples confirm that generated sequences can reproduce complex UI patterns like progress bars, output formatting, and REPL session behaviors with frame-wise geometrical and textual fidelity.
Representative behaviors include:
- Faithful simulacra of progress bars and parsing feedback during AI image-tool invocations.
- Syntactically and visually matched outputs for Python REPL math expressions, including prompt structure, output rendering, and buffer management.
6. Limitations and Roadmap Toward a Completely Neural Computer
Open challenges remain for CLIGen and the broader CNC ambition:
- Symbolic stability: arithmetic accuracy exceeds 80% only with extensive reprompting.
- Routine installation/reuse: current NCs cannot persistently store subroutines or revisit prior computations without fresh prompt conditioning.
- Long-horizon coherence: rollouts remain stable for approximately 5 seconds, after which drift and trajectory inconsistency occur.
- Lack of behavioral governance: the internal latent state 1 is opaque and not externally inspectable or lockable.
The CNC roadmap identifies the need for:
- Extensions to unbounded context via sliding-window Transformers or modular state memory.
- Gating for run-update separation (e.g., LSTM or Mixture-of-Experts).
- Dedicated neural modules for branching and symbolic operations.
- Explicit update APIs for safe, auditable installation and modification of internal state.
- Acceptance criteria including install–reuse testing, execution consistency, and governance/replay logs.
Achieving full CNC status would require Turing completeness, universal programmability, behavior consistency, and genuine machine-native compositional semantics within the neural substrate.
7. Formulas and Architectural Schematics
Key formal definitions and architecture are as follows:
- Update/render loop: 2, 3
- Latent diffusion loss: 4
- Decoupled cross-attention for video/prompt fusion:
5
- Action alignment loss: 6
- CLIGen system schematic: 7
These components define the computational, data, and evaluation foundations for reproducing and extending Terminal NC functionality. The approach represents a significant step toward self-contained, runtime-learned computer models that may eventually operationalize the CNC vision (Zhuge et al., 7 Apr 2026).