Papers
Topics
Authors
Recent
Search
2000 character limit reached

CLIGen Terminal NC: Neural Computer Paradigm

Updated 10 April 2026
  • CLIGen (Terminal NC) is a neural computer model that unifies computation, memory, and I/O by learning terminal behaviors directly from raw CLI traces via latent diffusion.
  • The system employs a latent video diffusion model with DiT transformer blocks that sequentially apply self-attention, text, and image cross-attention to generate precise terminal frame sequences.
  • Empirical results show improved text-to-pixel fidelity and OCR accuracy, while highlighting challenges in arithmetic stability and long-horizon coherence for neural computing.

CLIGen (Terminal NC), as introduced in "Neural Computers" (Zhuge et al., 7 Apr 2026), is an instantiation of the Neural Computer (NC) paradigm—a machine form where computation, memory, and I/O are unified in a learned runtime state. In contrast to conventional computers, which execute explicit, structured programs, or agent/world-model architectures that operate over explicit environment abstractions, Terminal NC aims for the model itself to embody the running computer. The long-term goal is the Completely Neural Computer (CNC): a general-purpose, stably executing, reprogrammable, and durably updateable system that can reuse capabilities. CLIGen provides both the architectural framework and empirical validation for learning foundational computer-like behaviors directly from raw command-line interface (CLI) I/O traces without instrumented program state.

1. Model Architecture: Latent-Diffusion Terminal Runtime

The core of CLIGen is a video generation backbone based on the Wan 2.1 latent video diffusion model. This consists of a variational autoencoder (VAE) encoder EE and decoder DD mapping 80×24 terminal frames xtx_t to and from a 64×24 latent grid ztz_t, combined with a DiT-style diffusion transformer. Each DiT block sequentially applies self-attention, text-prompt cross-attention, image-prompt cross-attention, then a feedforward network (FFN).

Conditioning channels include a text prompt pp (expressed in semantic, regular, or detailed style and encoded via a T5 model to CtextC_\text{text}) and the first frame x0x_0 (encoded both by VAE encoder to z0z_0 and by a CLIP image encoder to CimgC_\text{img}). The generation pipeline encodes the prompt and first frame, diffuses latent variables ztz_t stepwise with additive noise and DiT denoising, and decodes the latent series into pixel frames DD0.

2. Mathematical Formalization and Training Objectives

Given trajectory tuples DD1, where DD2 is the starting terminal frame, DD3 is the text instruction, and DD4 encodes user actions (tokenized), the system defines a latent runtime state DD5. The state is iteratively updated by a function DD6:

DD7

and rendered to output with DD8:

DD9

Within the diffusion paradigm, xtx_t0 and xtx_t1 correspond to the DiT transformer and the VAE decoder respectively.

The primary loss is the latent diffusion objective:

xtx_t2

For CLIGen Clean (with explicit user actions), an auxiliary action-alignment loss is optionally used:

xtx_t3

where xtx_t4 is a learned action prediction head.

3. Decoupled Cross-Attention and I/O Alignment

I/O alignment between prompt tokens and terminal video frames is achieved via "decoupled cross-attention" in DiT blocks. Starting with hidden xtx_t5, the computation at each layer is:

  • xtx_t6
  • xtx_t7
  • xtx_t8
  • xtx_t9

Empirical evaluation demonstrates that detailed token-by-token captions yield an improvement of nearly 5 dB PSNR in text-to-pixel fidelity compared to less literal styles, substantiating the benefit of close alignment between linguistic and visual representations.

4. Training Data, Preprocessing, and Evaluation

CLIGen leverages two purpose-built datasets:

Dataset Source/Collection Scale Features
CLIGen (General) Asciinema .cast logs, replayed and segmented 823k clips (~1,100h) Three styles of captions (semantic, regular, detailed), public TTYs, annotated at 15 FPS
CLIGen (Clean) Scripted Docker traces, vhs ~128k traces Explicit action alignment (Sleep, Type, Enter), deterministic font/color/pacing

All video clips are temporally aligned to action events and meticulously filtered/redacted for sensitive content.

Evaluation metrics include VAE reconstruction PSNR (40.77 dB, 0.989 SSIM at font 13px), text-prompt to video fidelity (detailed: 26.89 dB), and frame-level OCR accuracy rising from 0.03 to 0.54 across 60k optimization steps. Arithmetic execution is severely limited in base CLIGen (4%), but inclusion of direct answer hints in the prompt ("reprompting") increases accuracy to 83% on math tasks.

5. Learned Command-Line Execution and Qualitative Behaviors

Terminal NCs can synthesize core computational primitives directly in video space. The per-step latent ztz_t0 encodes the runtime terminal state, maintaining buffer history and rendering behaviors such as buffer scrolling, prompt wrapping, cursor movements, and color dynamics. Rollout examples confirm that generated sequences can reproduce complex UI patterns like progress bars, output formatting, and REPL session behaviors with frame-wise geometrical and textual fidelity.

Representative behaviors include:

  • Faithful simulacra of progress bars and parsing feedback during AI image-tool invocations.
  • Syntactically and visually matched outputs for Python REPL math expressions, including prompt structure, output rendering, and buffer management.

6. Limitations and Roadmap Toward a Completely Neural Computer

Open challenges remain for CLIGen and the broader CNC ambition:

  • Symbolic stability: arithmetic accuracy exceeds 80% only with extensive reprompting.
  • Routine installation/reuse: current NCs cannot persistently store subroutines or revisit prior computations without fresh prompt conditioning.
  • Long-horizon coherence: rollouts remain stable for approximately 5 seconds, after which drift and trajectory inconsistency occur.
  • Lack of behavioral governance: the internal latent state ztz_t1 is opaque and not externally inspectable or lockable.

The CNC roadmap identifies the need for:

  • Extensions to unbounded context via sliding-window Transformers or modular state memory.
  • Gating for run-update separation (e.g., LSTM or Mixture-of-Experts).
  • Dedicated neural modules for branching and symbolic operations.
  • Explicit update APIs for safe, auditable installation and modification of internal state.
  • Acceptance criteria including install–reuse testing, execution consistency, and governance/replay logs.

Achieving full CNC status would require Turing completeness, universal programmability, behavior consistency, and genuine machine-native compositional semantics within the neural substrate.

7. Formulas and Architectural Schematics

Key formal definitions and architecture are as follows:

  • Update/render loop: ztz_t2, ztz_t3
  • Latent diffusion loss: ztz_t4
  • Decoupled cross-attention for video/prompt fusion:

ztz_t5

  • Action alignment loss: ztz_t6
  • CLIGen system schematic: ztz_t7

These components define the computational, data, and evaluation foundations for reproducing and extending Terminal NC functionality. The approach represents a significant step toward self-contained, runtime-learned computer models that may eventually operationalize the CNC vision (Zhuge et al., 7 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
1.
Neural Computers  (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CLIGen (Terminal NC).