Papers
Topics
Authors
Recent
Search
2000 character limit reached

RadiX-Nets: Sparse Neural Architectures

Updated 26 January 2026
  • RadiX-Nets are sparse feed-forward neural network architectures defined by mixed-radix factorizations that ensure uniform connectivity, path-connectedness, and tunable sparsity.
  • They achieve near dense network performance (e.g., 95.5% accuracy on MNIST at 90% sparsity) while significantly reducing computational load and parameter counts.
  • Their flexible design supports a combinatorial variety of topologies and maintains theoretical expressivity guarantees comparable to fully connected models.

RadiX-Nets are a class of sparse feed-forward neural network topologies whose connectivity patterns are generated deterministically based on mixed-radix numeral systems, without reference to a dense parent network or post hoc pruning. These architectures aim to produce highly sparse, path-connected, and degree-uniform topologies that retain the expressive power and performance of their dense counterparts while reducing memory and computational requirements. RadiX-Nets generalize the explicit expander-inspired X-Nets by enabling much more diverse and tunable sparse structures through combinatorial factorizations, offering precise control over sparsity and degree distributions across layers (Robinett et al., 2019, &&&1&&&).

1. Formal Definition and Construction

A RadiX-Net is formally specified by assigning a mixed-radix factorization to each pair of consecutive layers. Given a layer \ell with nn_\ell neurons and layer +1\ell + 1 with n+1n_{\ell+1} neurons, a radix list

R=[r1,r2,,rd]R^\ell = [r^\ell_1, r^\ell_2, \dots, r^\ell_{d_\ell}]

is chosen so that i=1dri=n+1/n\prod_{i=1}^{d_\ell} r^\ell_i = n_{\ell+1}/n_\ell and riZ+r^\ell_i \in \mathbb{Z}^+. The adjacency matrix M{0,1}n×n+1M^\ell \in \{0,1\}^{n_\ell \times n_{\ell+1}} describing inter-layer connections is constructed via a Kronecker-sum of cyclic permutation (circulant) blocks: M=i=1dIr1Iri1A(ri)Iri+1Ird,M^\ell = \sum_{i=1}^{d_\ell} I_{r^\ell_1}\otimes\cdots\otimes I_{r^\ell_{i-1}} \otimes A(r^\ell_i) \otimes I_{r^\ell_{i+1}}\otimes\cdots\otimes I_{r^\ell_{d_\ell}}, where, for integer rr,

A(r)p,q={1,qp+1(modr), 0,otherwise.A(r)_{p,q} = \begin{cases} 1, & q \equiv p+1 \pmod r,\ 0, & \text{otherwise}. \end{cases}

The resulting network is highly sparse, with M0nn+1\|M^\ell\|_0 \ll n_\ell n_{\ell+1}. The full network topology is the collection M={M1,M2,...,ML}\mathcal{M} = \{M^1, M^2, ..., M^L\} over all LL layers (Kwak et al., 2023).

A generalization involves stacking multiple mixed-radix “blocks” and, via Kronecker product, incorporating arbitrarily sized dense widths between these structured sparse cores, resulting in flexible and scalable architectures (Robinett et al., 2019).

2. Structural Properties

RadiX-Nets are characterized by three principal structural properties:

  • Uniformity: Each neuron in layer \ell has exactly iri\sum_i r^\ell_i outgoing connections and each neuron in layer +1\ell+1 has the same indegree, ruling out “dead” or high-degree-imbalance neurons.
  • Path-connectedness & Equidistance: The Kronecker-sum construction ensures that the bipartite graph connecting adjacent layers is fully connected and that every input neuron has the same (minimal) graph distance to every output neuron. This equidistant, regular routing pattern underlies stable gradient propagation throughout the network (Kwak et al., 2023).
  • Sparsity Tunability: The per-layer and global sparsity levels,

s=1M0nn+1,sglobal=1M0nn+1,s_\ell = 1 - \frac{\|M^\ell\|_0}{n_\ell n_{\ell+1}}, \qquad s_{\text{global}} = 1 - \frac{\sum_\ell \|M^\ell\|_0}{\sum_\ell n_\ell n_{\ell+1}},

are determined directly by the selected radix factorizations. Typical configurations achieve sglobal90%s_{\text{global}} \approx 90\% or higher without loss of connectivity (Kwak et al., 2023, Robinett et al., 2019).

The combinatorial freedom in mixed-radix selection enables RadiX-Nets to realize an exponential diversity of topologies compared to prior X-Net/Cayley-graph designs, supporting arbitrary choices of blockwise radices and layer widths (Robinett et al., 2019).

3. Theoretical Expressivity

The expressive power of RadiX-Nets is formalized via a functional-analytic conjecture connecting their sparse, path-connected topologies to dense universal approximators. Let SN\mathcal{S}_N denote the set of symmetric, path-connected sparse feed-forward topologies of depth NN (including RadiX-Nets), and DN\mathfrak{D}_N the corresponding class of all fully connected architectures.

For the class C([0,1]n)C([0,1]^n) of continuous functions, the approximation widths are given by

δ(X)=supfC([0,1]n)infgXfg.\delta(\mathbb{X}) = \sup_{f \in C([0,1]^n)} \inf_{g \in \mathbb{X}}\|f-g\|_\infty.

The conjecture asserts: if the fully connected family achieves δ(DN)=O(Np)\delta(\mathbb{D}_N) = O(N^{-p}), then δ(SN)=O(Np)\delta(\mathbb{S}_N) = O(N^{-p}) as well. Thus, RadiX-Nets possess the same uniform approximation rates as dense nets, provided symmetry and path-connectedness hold, despite drastically fewer parameters (Robinett et al., 2019).

4. Empirical Performance and Training Dynamics

Experiments with the RadiX-Net TensorFlow 2.x testing suite on the MNIST dataset and LeNet-300-100 fully connected architecture demonstrate that:

  • Validation accuracies remain within 2–3% of the dense baseline (98.2%) for global sparsities up to 90%.
  • Parameter counts and per-epoch runtimes decrease nearly in proportion to achieved sparsity (e.g., at 90% sparsity: 26,761 parameters, 0.24s/epoch, 95.5% val. accuracy).
  • For sglobal75%s_{\text{global}} \gtrsim 75\%, variance in training outcomes increases and seed sensitivity becomes pronounced, particularly above 90% sparsity.
  • Around 5% of high-sparsity models—with at least one large radix (\geq 10)—exhibit “strange” training dynamics: persistent plateaux at low accuracies (28%–80%), rapid plateauing, and near-independence to initialization. Such pathologies are attributed to localized bottlenecks that violate uniform routing, concentrating paths through a small neuron subset and impeding optimization (Kwak et al., 2023).

A table summarizing representative results on MNIST is as follows:

Model Global Sparsity (sglobals_{\rm global}) Val. Accuracy (%) Epoch Time (s) #Parameters
Dense 0% 98.2 0.45 267,610
RadiX 50% 97.8 0.38 133,805
RadiX 75% 97.0 0.30 66,902
RadiX 90% 95.5 0.24 26,761
RadiX 99% 90.8 0.12 2,676

This suggests that properly parameterized RadiX-Nets closely track dense performance at up to 90% sparsity while yielding significant improvements in computational and storage workloads (Kwak et al., 2023, Robinett et al., 2019).

X-Nets, as introduced by Prabhu et al., are constructed using random expander graphs or explicit Cayley-graph (X-Linear) layers, the latter requiring constant out-degree and equal-size layers. This approach enables strong path-connectedness, but the diversity of possible topologies is severely constrained compared to RadiX-Nets (Robinett et al., 2019). RadiX-Net construction supports arbitrary layer widths and mixed-radix systems within and across layers, yielding a family that grows combinatorially in the number of radix choices and their arrangements.

Both frameworks empirically demonstrate dense-model-matching performance at densities as low as 5%–10%. However, RadiX-Nets' explicit symmetry and equidistance properties, flexibility of construction, and theoretical approximation guarantees set them apart in terms of structural generality and theoretical foundation.

6. Practical Implementation and Tooling

A dedicated TensorFlow-based suite implements RadiX-Net construction, visualization, and benchmarking (Kwak et al., 2023). The principal components are:

  1. Mask Function: Generates n×n+1n_\ell \times n_{\ell+1} sparse binary masks from specified radix lists, enabling zeroing of non-existent weights.
  2. RadixLayer: Subclasses tf.keras.layers.Layer to support radiX-structured sparsity, applying the mask at each forward pass.
  3. CustomModel: Combines multiple RadixLayers according to user-specified mixed-radix lists and activations with visualization and training-curve comparison functionality.

Standard training employs the Adam optimizer, categorical cross-entropy loss, and explicit application of masks to enforce sparsity throughout optimization.

7. Limitations and Directions for Future Research

While RadiX-Nets deliver substantial theoretical and empirical efficiency gains, sparsity-induced pathologies (“strange models”) signal that not all radix factorizations are suitable. Large radices in particular can create narrow funnels and impede robust training due to loss of uniform mixing.

Outstanding research directions include:

  • Developing heuristics or optimization-based methods for mixed-radix selection to avoid topological bottlenecks.
  • Theoretical analysis of the spectral properties of mixed-radix adjacency matrices in relation to training stability and gradient flow.
  • Extending sparse topology construction to convolutional, transformer, and other architectures on larger-scale datasets (CIFAR-10/100, ImageNet).
  • Investigation of initialization schemes tailored to RadiX-induced sparsity patterns (e.g., modified Glorot or orthogonal approaches).

These avenues aim to transition RadiX-Nets into broadly deployable models for highly scalable sparse deep learning (Kwak et al., 2023, Robinett et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RadiX-Nets.