Papers
Topics
Authors
Recent
Search
2000 character limit reached

HC Newton (HCN) in mHC-Style Transformers

Updated 4 July 2026
  • HC Newton (HCN) is a structured Newton method for mHC-style residual Transformers that replaces costly Jacobian-vector products with learned surrogate mixing matrices.
  • It reformulates the hidden-state trace as a nonlinear residual system and employs small-matrix corrections to enable efficient parallel computation.
  • SNLP-aware regularization improves both sequential perplexity and inference speed, as demonstrated by up to 20-30% speedups on 0.5B Nanochat models.

HC Newton (HCN) is the Structured Newton Layer Parallelism (SNLP) instantiation for mHC-style residual Transformers, where each block maintains a small learned residual-stream mixture that serves as a surrogate Jacobian in place of the full hidden-state Jacobian. In this formulation, the layerwise hidden-state trace is recast as a nonlinear residual equation, and inference is performed by Newton-style corrections that exploit the model’s residual mixing matrix rather than exact Jacobian-vector products. Within the broader SNLP framework, HCN is the mHC counterpart of Identity Newton (IDN), which is used for residual Transformers; the paper studies HCN as both an inference procedure and a training target via SNLP-aware regularization (Han et al., 18 May 2026).

1. Architectural setting and conceptual role

HCN is defined for mHC-style residual Transformers. In this setting, an LL-layer decoder has hidden states h0,,hLRdh_0,\dots,h_L \in \mathbb{R}^d, and each block uses a small residual mixture over MM streams. The block is written in two stages: xl=Hlres,attnhl1+Hlpost,attnAttnl(Hlpre,attnhl1),x'_l = H_{l}^{res,attn}\,h_{l-1} + H_{l}^{post,attn}\cdot \mathrm{Attn}_l(H_{l}^{pre,attn}\,h_{l-1}),

hl=Hlres,mlpxl+Hlpost,mlpMLPl(Hlpre,mlpxl).h_l = H_{l}^{res,mlp}\,x'_l + H_{l}^{post,mlp}\cdot \mathrm{MLP}_l(H_{l}^{pre,mlp}\,x'_l).

Each HH^{\cdot} is an M×MM\times M stream-mixing matrix, with MdM \ll d (Han et al., 18 May 2026).

Within SNLP, HCN addresses the latency bottleneck created by strictly sequential Transformer-layer execution. The central idea is not to remove the forward dependency exactly, but to relax it through a structured Newton correction whose surrogate dynamics are induced by the architecture itself. The paper places HCN in a broader program that asks whether the hidden-state trace across layers can be treated as the solution of a nonlinear residual equation and then approximated by parallel Newton-style updates. In that sense, HCN is a solver-based layer-parallel approximation specialized to mHC blocks rather than a generic parallelization heuristic.

A common misunderstanding is to treat HCN as exact Newton inference. The paper explicitly distinguishes the two: exact Newton would require expensive Jacobian-vector products, whereas HCN replaces the exact layer Jacobians with a cheap surrogate built from residual mixing matrices. This makes HCN a structured Newton method rather than a full Newton solve.

2. Residual-system formulation

The mHC forward pass is rewritten as a root-finding problem. Writing H=(h1,,hL)H=(h_1,\dots,h_L) and defining

Gl(H)=hlfl(hl1),l=1,,L,G_l(H) = h_l - f_l(h_{l-1}), \qquad l=1,\dots,L,

the sequential forward satisfies h0,,hLRdh_0,\dots,h_L \in \mathbb{R}^d0, where h0,,hLRdh_0,\dots,h_L \in \mathbb{R}^d1 is a block-lower-bidiagonal system (Han et al., 18 May 2026).

This formulation is the mathematical basis for HCN. Rather than viewing the model purely as a composition of layers, the method views the full layer trace as the solution of a coupled nonlinear system. In exact Newton form, the Jacobian of h0,,hLRdh_0,\dots,h_L \in \mathbb{R}^d2 would be

h0,,hLRdh_0,\dots,h_L \in \mathbb{R}^d3

with h0,,hLRdh_0,\dots,h_L \in \mathbb{R}^d4.

The significance of this reformulation is methodological. It turns the serial layer dependency into a structured nonlinear solve, which in turn makes it possible to ask whether a small number of approximate Newton corrections can recover most of the sequential computation. The paper’s broader claim is that this perspective is principled, but that naïve fixed-point iterations are unstable on trained Transformers and exact Newton is too expensive. HCN occupies the middle ground created by that observation.

3. Surrogate Jacobian and HC Newton update

The defining approximation in HCN is the replacement of the exact layer Jacobian h0,,hLRdh_0,\dots,h_L \in \mathbb{R}^d5 by a small surrogate matrix

h0,,hLRdh_0,\dots,h_L \in \mathbb{R}^d6

which is only h0,,hLRdh_0,\dots,h_L \in \mathbb{R}^d7 (Han et al., 18 May 2026). The paper motivates this by decomposing

h0,,hLRdh_0,\dots,h_L \in \mathbb{R}^d8

and then training the model so that these branch terms become small, making h0,,hLRdh_0,\dots,h_L \in \mathbb{R}^d9.

Newton’s method for MM0 would take the form

MM1

HCN replaces MM2 with a surrogate block matrix whose diagonal blocks are MM3 and whose off-diagonal blocks are MM4: MM5

MM6

Because this surrogate system is block-lower-bidiagonal, applying the inverse reduces to a recurrence: MM7 and for MM8,

MM9

This update is the operational core of HCN. It preserves the lower-bidiagonal structure of the original residual system while replacing the expensive xl=Hlres,attnhl1+Hlpost,attnAttnl(Hlpre,attnhl1),x'_l = H_{l}^{res,attn}\,h_{l-1} + H_{l}^{post,attn}\cdot \mathrm{Attn}_l(H_{l}^{pre,attn}\,h_{l-1}),0 Jacobians with the model’s learned stream-mixing matrices. A plausible implication is that HCN is best understood as a structured preconditioned Newton step whose preconditioner is supplied by the architecture.

4. Inference procedure and computational profile

The paper describes HCN inference in a prefix–suffix decomposition. A prefix of xl=Hlres,attnhl1+Hlpost,attnAttnl(Hlpre,attnhl1),x'_l = H_{l}^{res,attn}\,h_{l-1} + H_{l}^{post,attn}\cdot \mathrm{Attn}_l(H_{l}^{pre,attn}\,h_{l-1}),1 layers is run sequentially, and a suffix of xl=Hlres,attnhl1+Hlpost,attnAttnl(Hlpre,attnhl1),x'_l = H_{l}^{res,attn}\,h_{l-1} + H_{l}^{post,attn}\cdot \mathrm{Attn}_l(H_{l}^{pre,attn}\,h_{l-1}),2 layers is then processed by SNLP. With xl=Hlres,attnhl1+Hlpost,attnAttnl(Hlpre,attnhl1),x'_l = H_{l}^{res,attn}\,h_{l-1} + H_{l}^{post,attn}\cdot \mathrm{Attn}_l(H_{l}^{pre,attn}\,h_{l-1}),3, the prefix is computed by the ordinary sequential recurrence xl=Hlres,attnhl1+Hlpost,attnAttnl(Hlpre,attnhl1),x'_l = H_{l}^{res,attn}\,h_{l-1} + H_{l}^{post,attn}\cdot \mathrm{Attn}_l(H_{l}^{pre,attn}\,h_{l-1}),4 for xl=Hlres,attnhl1+Hlpost,attnAttnl(Hlpre,attnhl1),x'_l = H_{l}^{res,attn}\,h_{l-1} + H_{l}^{post,attn}\cdot \mathrm{Attn}_l(H_{l}^{pre,attn}\,h_{l-1}),5. The suffix is initialized by setting xl=Hlres,attnhl1+Hlpost,attnAttnl(Hlpre,attnhl1),x'_l = H_{l}^{res,attn}\,h_{l-1} + H_{l}^{post,attn}\cdot \mathrm{Attn}_l(H_{l}^{pre,attn}\,h_{l-1}),6 for xl=Hlres,attnhl1+Hlpost,attnAttnl(Hlpre,attnhl1),x'_l = H_{l}^{res,attn}\,h_{l-1} + H_{l}^{post,attn}\cdot \mathrm{Attn}_l(H_{l}^{pre,attn}\,h_{l-1}),7 (Han et al., 18 May 2026).

Each HCN iteration then has two stages. First, the nonlinear blocks in the suffix are evaluated in parallel: xl=Hlres,attnhl1+Hlpost,attnAttnl(Hlpre,attnhl1),x'_l = H_{l}^{res,attn}\,h_{l-1} + H_{l}^{post,attn}\cdot \mathrm{Attn}_l(H_{l}^{pre,attn}\,h_{l-1}),8 Second, a sequential small-matrix correction is applied: xl=Hlres,attnhl1+Hlpost,attnAttnl(Hlpre,attnhl1),x'_l = H_{l}^{res,attn}\,h_{l-1} + H_{l}^{post,attn}\cdot \mathrm{Attn}_l(H_{l}^{pre,attn}\,h_{l-1}),9 The paper states that, in practice, hl=Hlres,mlpxl+Hlpost,mlpMLPl(Hlpre,mlpxl).h_l = H_{l}^{res,mlp}\,x'_l + H_{l}^{post,mlp}\cdot \mathrm{MLP}_l(H_{l}^{pre,mlp}\,x'_l).0 or hl=Hlres,mlpxl+Hlpost,mlpMLPl(Hlpre,mlpxl).h_l = H_{l}^{res,mlp}\,x'_l + H_{l}^{post,mlp}\cdot \mathrm{MLP}_l(H_{l}^{pre,mlp}\,x'_l).1 is enough on mHC models trained with the SNLP-aware loss.

The computational argument for HCN is tied to this separation between expensive nonlinear block evaluations and cheap stream-mixing corrections. If hl=Hlres,mlpxl+Hlpost,mlpMLPl(Hlpre,mlpxl).h_l = H_{l}^{res,mlp}\,x'_l + H_{l}^{post,mlp}\cdot \mathrm{MLP}_l(H_{l}^{pre,mlp}\,x'_l).2 denotes the cost of one Transformer block forward, sequential execution has cost approximately hl=Hlres,mlpxl+Hlpost,mlpMLPl(Hlpre,mlpxl).h_l = H_{l}^{res,mlp}\,x'_l + H_{l}^{post,mlp}\cdot \mathrm{MLP}_l(H_{l}^{pre,mlp}\,x'_l).3. HCN with hl=Hlres,mlpxl+Hlpost,mlpMLPl(Hlpre,mlpxl).h_l = H_{l}^{res,mlp}\,x'_l + H_{l}^{post,mlp}\cdot \mathrm{MLP}_l(H_{l}^{pre,mlp}\,x'_l).4 iterations evaluates the hl=Hlres,mlpxl+Hlpost,mlpMLPl(Hlpre,mlpxl).h_l = H_{l}^{res,mlp}\,x'_l + H_{l}^{post,mlp}\cdot \mathrm{MLP}_l(H_{l}^{pre,mlp}\,x'_l).5 suffix blocks in parallel, giving cost approximately hl=Hlres,mlpxl+Hlpost,mlpMLPl(Hlpre,mlpxl).h_l = H_{l}^{res,mlp}\,x'_l + H_{l}^{post,mlp}\cdot \mathrm{MLP}_l(H_{l}^{pre,mlp}\,x'_l).6 when the layers are batched into one giant fused kernel, plus a correction loop of cost hl=Hlres,mlpxl+Hlpost,mlpMLPl(Hlpre,mlpxl).h_l = H_{l}^{res,mlp}\,x'_l + H_{l}^{post,mlp}\cdot \mathrm{MLP}_l(H_{l}^{pre,mlp}\,x'_l).7. Since hl=Hlres,mlpxl+Hlpost,mlpMLPl(Hlpre,mlpxl).h_l = H_{l}^{res,mlp}\,x'_l + H_{l}^{post,mlp}\cdot \mathrm{MLP}_l(H_{l}^{pre,mlp}\,x'_l).8, the paper characterizes this correction cost as negligible versus hl=Hlres,mlpxl+Hlpost,mlpMLPl(Hlpre,mlpxl).h_l = H_{l}^{res,mlp}\,x'_l + H_{l}^{post,mlp}\cdot \mathrm{MLP}_l(H_{l}^{pre,mlp}\,x'_l).9. The memory footprint requires two copies of the suffix states, giving roughly HH^{\cdot}0 floats versus HH^{\cdot}1 for sequential execution, described as an HH^{\cdot}2 overhead (Han et al., 18 May 2026).

5. SNLP-aware regularization

HCN is not presented only as an inference-time approximation. The paper also introduces SNLP-aware regularization to make the surrogate matrices HH^{\cdot}3 better approximations to the true layer Jacobians. The training objective augments the standard language-model cross-entropy with a term that penalizes divergence between SNLP-produced hidden states and sequential hidden states: HH^{\cdot}4 In training, the paper uses HH^{\cdot}5, while HH^{\cdot}6 selects a possibly strided subset of layers including the final layer (Han et al., 18 May 2026).

The stated purpose of this loss is twofold. First, it forces one HCN iteration to reproduce the sequential trace. Second, it pushes HH^{\cdot}7 toward HH^{\cdot}8. This means that HCN is not merely an inference wrapper placed on top of a frozen architecture; rather, it is part of a co-design in which the model is trained to admit accurate structured Newton corrections.

For the reported 0.5B mHC Nanochat experiment, the paper uses HH^{\cdot}9 streams, M×MM\times M0 layers, M×MM\times M1, and stride M×MM\times M2, with the regularizer matching only the final hidden state. The paper also states more broadly that SNLP regularization improves layer-parallel compatibility and can improve standard sequential perplexity, reducing baseline perplexity by M×MM\times M3 on nanochat-scale Transformers.

6. Empirical results, scope, and limitations

The paper reports HCN-specific results on a 0.5B Nanochat-mHC model. The baseline sequential model has validation perplexity M×MM\times M4. After SNLP-aware regularization and one HCN iteration with prefix-state initialization, the sequential perplexity becomes M×MM\times M5, a reported improvement of M×MM\times M6 (Han et al., 18 May 2026).

The detailed mHC Nanochat results are summarized below.

Configuration PPL Speed
Seq PPL (No Reg) 73.24
Seq PPL (HCN Reg) 67.23
20×F1-h0, M×MM\times M7 66.56 M×MM\times M8
8×F1-h0, M×MM\times M9 65.91 MdM \ll d0

These results show two distinct regimes. A chunkwise layer-fusion configuration labeled MdM \ll d1-h0 with MdM \ll d2 yields perplexity MdM \ll d3 and wall-clock speedup of approximately MdM \ll d4. A more quality-oriented configuration labeled MdM \ll d5-h0 with MdM \ll d6 yields perplexity MdM \ll d7 with speed near parity, approximately MdM \ll d8. The paper interprets these results as evidence that HCN can both improve sequential perplexity by “cleaning up the layer trace” and unlock approximately MdM \ll d9 wall-clock speedups on mHC models when chunked carefully (Han et al., 18 May 2026).

At the framework level, the paper reports that SNLP combined with layer fusion and chunkwise decomposition reaches H=(h1,,hL)H=(h_1,\dots,h_L)0 speedup on a 0.5B Nanochat model while still improving perplexity by H=(h1,,hL)H=(h_1,\dots,h_L)1. The same paper also characterizes several limitations. Off-the-shelf pretrained models are described as less amenable to the procedure. Exact convergence recovers the sequential computation rather than providing monotonic inference-time scaling. The paper therefore argues that layer-parallel inference should not be viewed solely as a numerical approximation to sequential execution, but can also act as a useful solver-induced inference bias. This suggests that the practical value of HCN lies not only in asymptotic parallelism, but in the joint interaction among architecture, training regularization, and truncated structured Newton inference.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HC Newton (HCN).