HC Newton (HCN) in mHC-Style Transformers
- HC Newton (HCN) is a structured Newton method for mHC-style residual Transformers that replaces costly Jacobian-vector products with learned surrogate mixing matrices.
- It reformulates the hidden-state trace as a nonlinear residual system and employs small-matrix corrections to enable efficient parallel computation.
- SNLP-aware regularization improves both sequential perplexity and inference speed, as demonstrated by up to 20-30% speedups on 0.5B Nanochat models.
HC Newton (HCN) is the Structured Newton Layer Parallelism (SNLP) instantiation for mHC-style residual Transformers, where each block maintains a small learned residual-stream mixture that serves as a surrogate Jacobian in place of the full hidden-state Jacobian. In this formulation, the layerwise hidden-state trace is recast as a nonlinear residual equation, and inference is performed by Newton-style corrections that exploit the model’s residual mixing matrix rather than exact Jacobian-vector products. Within the broader SNLP framework, HCN is the mHC counterpart of Identity Newton (IDN), which is used for residual Transformers; the paper studies HCN as both an inference procedure and a training target via SNLP-aware regularization (Han et al., 18 May 2026).
1. Architectural setting and conceptual role
HCN is defined for mHC-style residual Transformers. In this setting, an -layer decoder has hidden states , and each block uses a small residual mixture over streams. The block is written in two stages:
Each is an stream-mixing matrix, with (Han et al., 18 May 2026).
Within SNLP, HCN addresses the latency bottleneck created by strictly sequential Transformer-layer execution. The central idea is not to remove the forward dependency exactly, but to relax it through a structured Newton correction whose surrogate dynamics are induced by the architecture itself. The paper places HCN in a broader program that asks whether the hidden-state trace across layers can be treated as the solution of a nonlinear residual equation and then approximated by parallel Newton-style updates. In that sense, HCN is a solver-based layer-parallel approximation specialized to mHC blocks rather than a generic parallelization heuristic.
A common misunderstanding is to treat HCN as exact Newton inference. The paper explicitly distinguishes the two: exact Newton would require expensive Jacobian-vector products, whereas HCN replaces the exact layer Jacobians with a cheap surrogate built from residual mixing matrices. This makes HCN a structured Newton method rather than a full Newton solve.
2. Residual-system formulation
The mHC forward pass is rewritten as a root-finding problem. Writing and defining
the sequential forward satisfies 0, where 1 is a block-lower-bidiagonal system (Han et al., 18 May 2026).
This formulation is the mathematical basis for HCN. Rather than viewing the model purely as a composition of layers, the method views the full layer trace as the solution of a coupled nonlinear system. In exact Newton form, the Jacobian of 2 would be
3
with 4.
The significance of this reformulation is methodological. It turns the serial layer dependency into a structured nonlinear solve, which in turn makes it possible to ask whether a small number of approximate Newton corrections can recover most of the sequential computation. The paper’s broader claim is that this perspective is principled, but that naïve fixed-point iterations are unstable on trained Transformers and exact Newton is too expensive. HCN occupies the middle ground created by that observation.
3. Surrogate Jacobian and HC Newton update
The defining approximation in HCN is the replacement of the exact layer Jacobian 5 by a small surrogate matrix
6
which is only 7 (Han et al., 18 May 2026). The paper motivates this by decomposing
8
and then training the model so that these branch terms become small, making 9.
Newton’s method for 0 would take the form
1
HCN replaces 2 with a surrogate block matrix whose diagonal blocks are 3 and whose off-diagonal blocks are 4: 5
6
Because this surrogate system is block-lower-bidiagonal, applying the inverse reduces to a recurrence: 7 and for 8,
9
This update is the operational core of HCN. It preserves the lower-bidiagonal structure of the original residual system while replacing the expensive 0 Jacobians with the model’s learned stream-mixing matrices. A plausible implication is that HCN is best understood as a structured preconditioned Newton step whose preconditioner is supplied by the architecture.
4. Inference procedure and computational profile
The paper describes HCN inference in a prefix–suffix decomposition. A prefix of 1 layers is run sequentially, and a suffix of 2 layers is then processed by SNLP. With 3, the prefix is computed by the ordinary sequential recurrence 4 for 5. The suffix is initialized by setting 6 for 7 (Han et al., 18 May 2026).
Each HCN iteration then has two stages. First, the nonlinear blocks in the suffix are evaluated in parallel: 8 Second, a sequential small-matrix correction is applied: 9 The paper states that, in practice, 0 or 1 is enough on mHC models trained with the SNLP-aware loss.
The computational argument for HCN is tied to this separation between expensive nonlinear block evaluations and cheap stream-mixing corrections. If 2 denotes the cost of one Transformer block forward, sequential execution has cost approximately 3. HCN with 4 iterations evaluates the 5 suffix blocks in parallel, giving cost approximately 6 when the layers are batched into one giant fused kernel, plus a correction loop of cost 7. Since 8, the paper characterizes this correction cost as negligible versus 9. The memory footprint requires two copies of the suffix states, giving roughly 0 floats versus 1 for sequential execution, described as an 2 overhead (Han et al., 18 May 2026).
5. SNLP-aware regularization
HCN is not presented only as an inference-time approximation. The paper also introduces SNLP-aware regularization to make the surrogate matrices 3 better approximations to the true layer Jacobians. The training objective augments the standard language-model cross-entropy with a term that penalizes divergence between SNLP-produced hidden states and sequential hidden states: 4 In training, the paper uses 5, while 6 selects a possibly strided subset of layers including the final layer (Han et al., 18 May 2026).
The stated purpose of this loss is twofold. First, it forces one HCN iteration to reproduce the sequential trace. Second, it pushes 7 toward 8. This means that HCN is not merely an inference wrapper placed on top of a frozen architecture; rather, it is part of a co-design in which the model is trained to admit accurate structured Newton corrections.
For the reported 0.5B mHC Nanochat experiment, the paper uses 9 streams, 0 layers, 1, and stride 2, with the regularizer matching only the final hidden state. The paper also states more broadly that SNLP regularization improves layer-parallel compatibility and can improve standard sequential perplexity, reducing baseline perplexity by 3 on nanochat-scale Transformers.
6. Empirical results, scope, and limitations
The paper reports HCN-specific results on a 0.5B Nanochat-mHC model. The baseline sequential model has validation perplexity 4. After SNLP-aware regularization and one HCN iteration with prefix-state initialization, the sequential perplexity becomes 5, a reported improvement of 6 (Han et al., 18 May 2026).
The detailed mHC Nanochat results are summarized below.
| Configuration | PPL | Speed |
|---|---|---|
| Seq PPL (No Reg) | 73.24 | — |
| Seq PPL (HCN Reg) | 67.23 | — |
| 20×F1-h0, 7 | 66.56 | 8 |
| 8×F1-h0, 9 | 65.91 | 0 |
These results show two distinct regimes. A chunkwise layer-fusion configuration labeled 1-h0 with 2 yields perplexity 3 and wall-clock speedup of approximately 4. A more quality-oriented configuration labeled 5-h0 with 6 yields perplexity 7 with speed near parity, approximately 8. The paper interprets these results as evidence that HCN can both improve sequential perplexity by “cleaning up the layer trace” and unlock approximately 9 wall-clock speedups on mHC models when chunked carefully (Han et al., 18 May 2026).
At the framework level, the paper reports that SNLP combined with layer fusion and chunkwise decomposition reaches 0 speedup on a 0.5B Nanochat model while still improving perplexity by 1. The same paper also characterizes several limitations. Off-the-shelf pretrained models are described as less amenable to the procedure. Exact convergence recovers the sequential computation rather than providing monotonic inference-time scaling. The paper therefore argues that layer-parallel inference should not be viewed solely as a numerical approximation to sequential execution, but can also act as a useful solver-induced inference bias. This suggests that the practical value of HCN lies not only in asymptotic parallelism, but in the joint interaction among architecture, training regularization, and truncated structured Newton inference.