Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hourglass MLPs: High-Dimensional Residual Refinement

Updated 19 January 2026
  • Hourglass MLPs are neural network architectures that invert conventional residual block designs by employing wide high-dimensional skip connections and narrow bottleneck paths.
  • They leverage fixed random projections to efficiently lift input vectors, reducing trainable parameters while preserving geometric properties for robust performance.
  • Empirical results in generative, denoising, and image restoration tasks highlight their superior expressivity and parameter efficiency over conventional MLPs.

Hourglass MLPs are multi-layer perceptron architectures characterized by an inversion of the conventional block shape, employing a wide–narrow–wide structure. In these designs, residual (skip) connections operate in an expanded high-dimensional latent space, while the learnable computation proceeds through a sequence of narrow bottlenecks. This configuration facilitates highly expressive incremental refinement within a rich latent representation, while optimizing parameter economy and efficiency. Hourglass MLPs leverage fixed random projections into high-dimensional spaces, yielding further savings in trainable parameters and memory bandwidth. Empirical studies demonstrate consistent superiority of Hourglass architectures over conventional MLPs in generative, denoising, and image restoration tasks, with distinctly different scaling behaviors as parameter budgets increase (Chen et al., 2 Oct 2025).

1. Architectural Principles and Motivation

Conventional residual MLP blocks employ a narrow–wide–narrow schema:

  • Input/output dimension dxd_x corresponds to token or pixel-vector size.
  • Hidden expansion dh>dxd_h > d_x.
  • Block operation: xi+1=xi+W2σ(W1norm(xi))x_{i+1} = x_i + W_2\, \sigma(W_1\, \mathrm{norm}(x_i)), where W1Rdh×dxW_1 \in \mathbb{R}^{d_h \times d_x}, W2Rdx×dhW_2 \in \mathbb{R}^{d_x \times d_h}.
  • The skip connection operates at dxd_x, confining learnable residuals to the input/output space.

Hourglass MLP blocks reverse this configuration:

  • Use a high-dimensional latent space dzdxd_z \gg d_x for the skip connection.
  • Employ a narrow bottleneck dh<dzd_h < d_z for the computation pathway.
  • Structured as:

    1. Input lift: z0=Winx0z_0 = W_{\text{in}} x_0, WinRdz×dxW_{\text{in}} \in \mathbb{R}^{d_z \times d_x}.
    2. dh>dxd_h > d_x0 residual Hourglass blocks: dh>dxd_h > d_x1, dh>dxd_h > d_x2, dh>dxd_h > d_x3.
    3. Final projection: dh>dxd_h > d_x4, dh>dxd_h > d_x5.

This design enables residual pathways to live in richer, high-dimensional feature spaces, potentially allowing for more expressive incremental corrections. The bottleneck restricts the cost of each block, facilitating greater model depth under a fixed parameter budget.

2. Fixed Random Projection Strategies

Hourglass MLPs frequently employ a fixed random projection dh>dxd_h > d_x6 to lift input vectors into the expanded latent space. Theoretical foundations in reservoir computing, random-feature models, Johnson–Lindenstrauss, and compressive-sensing indicate that such projections preserve essential geometric and discriminative properties with high probability, provided dh>dxd_h > d_x7.

Key benefits include:

  • Elimination of trainable parameters for dh>dxd_h > d_x8.

  • Reduced memory and bandwidth overhead, as random matrices can be generated on-the-fly.
  • Comparable empirical performance: In ImageNet-32 denoising with dh>dxd_h > d_x9, models with fixed versus trainable xi+1=xi+W2σ(W1norm(xi))x_{i+1} = x_i + W_2\, \sigma(W_1\, \mathrm{norm}(x_i))0 yield nearly identical PSNR curves (difference xi+1=xi+W2σ(W1norm(xi))x_{i+1} = x_i + W_2\, \sigma(W_1\, \mathrm{norm}(x_i))1 dB).

Across evaluated tasks, Hourglass MLPs with fixed projections consistently align with the Pareto frontier of their fully trainable counterparts.

3. Parameter Budget and Computational Complexity

Let xi+1=xi+W2σ(W1norm(xi))x_{i+1} = x_i + W_2\, \sigma(W_1\, \mathrm{norm}(x_i))2, xi+1=xi+W2σ(W1norm(xi))x_{i+1} = x_i + W_2\, \sigma(W_1\, \mathrm{norm}(x_i))3 be the expansion (xi+1=xi+W2σ(W1norm(xi))x_{i+1} = x_i + W_2\, \sigma(W_1\, \mathrm{norm}(x_i))4), xi+1=xi+W2σ(W1norm(xi))x_{i+1} = x_i + W_2\, \sigma(W_1\, \mathrm{norm}(x_i))5 the bottleneck width, xi+1=xi+W2σ(W1norm(xi))x_{i+1} = x_i + W_2\, \sigma(W_1\, \mathrm{norm}(x_i))6 the stack depth. The parameter count for an Hourglass MLP is:

  • Trainable: xi+1=xi+W2σ(W1norm(xi))x_{i+1} = x_i + W_2\, \sigma(W_1\, \mathrm{norm}(x_i))7 (input lift) xi+1=xi+W2σ(W1norm(xi))x_{i+1} = x_i + W_2\, \sigma(W_1\, \mathrm{norm}(x_i))8 (per-block) xi+1=xi+W2σ(W1norm(xi))x_{i+1} = x_i + W_2\, \sigma(W_1\, \mathrm{norm}(x_i))9.
  • With fixed W1Rdh×dxW_1 \in \mathbb{R}^{d_h \times d_x}0: W1Rdh×dxW_1 \in \mathbb{R}^{d_h \times d_x}1.

Contrast with conventional MLPs (expansion W1Rdh×dxW_1 \in \mathbb{R}^{d_h \times d_x}2):

  • W1Rdh×dxW_1 \in \mathbb{R}^{d_h \times d_x}3.

To match parameter budgets, Hourglass architectures select W1Rdh×dxW_1 \in \mathbb{R}^{d_h \times d_x}4, W1Rdh×dxW_1 \in \mathbb{R}^{d_h \times d_x}5, W1Rdh×dxW_1 \in \mathbb{R}^{d_h \times d_x}6 such that W1Rdh×dxW_1 \in \mathbb{R}^{d_h \times d_x}7.

Forward FLOPs per block:

  • Hourglass: W1Rdh×dxW_1 \in \mathbb{R}^{d_h \times d_x}8.
  • Conventional: W1Rdh×dxW_1 \in \mathbb{R}^{d_h \times d_x}9.

The bottleneck width W2Rdx×dhW_2 \in \mathbb{R}^{d_x \times d_h}0 and increased depth W2Rdx×dhW_2 \in \mathbb{R}^{d_x \times d_h}1 allow Hourglass MLPs to sustain cost parity while enhancing expressivity through deeper stacks operating in wider latent dimensions.

4. Empirical Performance and Scaling Behavior

Hourglass MLPs have been empirically evaluated on image-generation, denoising, and super-resolution tasks using MNIST and ImageNet-32 datasets:

Task Dataset Hourglass Params Conventional Params Hourglass PSNR Conventional PSNR
Denoising MNIST 66 M 75 M 22.31 dB 22.31 dB
Super-resolution ImageNet-32 69 M 87 M 24.00 dB 24.00 dB

Metrics employed include PSNR (dB), SSIM for reconstruction, and classification accuracy via prototype generation.

Hourglass MLPs consistently achieve superior performance–parameter Pareto frontiers in all evaluated settings. Optimization under increasing parameter budgets consistently drives Hourglass designs toward very large W2Rdx×dhW_2 \in \mathbb{R}^{d_x \times d_h}2 W2Rdx×dhW_2 \in \mathbb{R}^{d_x \times d_h}3–4 KW2Rdx×dhW_2 \in \mathbb{R}^{d_x \times d_h}4 and moderate W2Rdx×dhW_2 \in \mathbb{R}^{d_x \times d_h}5 W2Rdx×dhW_2 \in \mathbb{R}^{d_x \times d_h}6–300W2Rdx×dhW_2 \in \mathbb{R}^{d_x \times d_h}7, while increasing network depth W2Rdx×dhW_2 \in \mathbb{R}^{d_x \times d_h}8 W2Rdx×dhW_2 \in \mathbb{R}^{d_x \times d_h}9–8dxd_x0 rather than bottleneck width. This “wider skip + narrower bottleneck + deeper stack” scaling is not Pareto-optimal for conventional MLPs.

5. Broader Implications and Application Extensions

Findings suggest reconsideration of skip connection dimensionality in residual networks. Replacing conventional feed-forward layers in Transformers with hourglass-style FFNs (dxd_x1), and adapting self-attention mechanisms to operate within the expanded latent dxd_x2 space, yields potential parameter savings in large-scale LLMs.

In architectures such as U-Nets and MLP-Mixers, injecting a fixed random lift into high-dimensional latent space and operating through narrow-bottleneck Hourglass blocks allows flexible adaptation for tasks including classification, segmentation, and generation. Any residual network currently employing skips at a narrow feature size may achieve increased expressivity and parameter efficiency by relocating skip connections into expanded spaces and routing learned incremental changes through cost-effective bottlenecks.

6. Practical Guidelines for Construction

Recommendations for Hourglass MLP configuration:

  1. Select dxd_x3 such that dxd_x4–5 K when dxd_x5 K, ensuring geometry preservation via random lifts.
  2. Set dxd_x6 to a moderate range dxd_x7–300dxd_x8 to maintain per-block cost parity with conventional blocks.
  3. Utilize maximal depth dxd_x9 as allowed by the parameter budget; empirically, dzdxd_z \gg d_x0–8 sufficiently saturates performance gains.
  4. Employ fixed random dzdxd_z \gg d_x1 to optimize parameter usage and memory bandwidth.
  5. Assess model selection along the performance–parameter frontier; Hourglass MLPs typically dominate across varied generative and classification benchmarks.

The scaling and architectural principles identified in Hourglass MLPs suggest wide applicability and invite further investigation into expanded skip-dimensionality and bottleneck routing within modern neural architectures (Chen et al., 2 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hourglass MLPs.