Existence of a network achieving Bayes-optimal α^{-1} generalization while memorizing facts

Determine whether there exists a neural network architecture that simultaneously achieves the Bayes-optimal α^{-1} generalization rate on the teacher-rule task of the Rules-and-Facts (RAF) model and memorizes the random factual labels; if so, construct or rigorously analyze such an architecture (for example, a wide two-layer network with a trainable first layer) to establish this rate on RAF data.

Background

In the RAF setting, kernel ridge regression is proven to exhibit an α{-1/2} decay in generalization error, and numerical evidence indicates a similar rate for hinge-loss SVMs, both while allowing factual memorization. The Bayes-optimal predictor, however, achieves α{-1}.

The authors suggest that feature-learning (e.g., a wide two-layer network with a trainable first layer) might attain the α{-1} rate while still memorizing facts, but emphasize that analyzing such a setting on RAF data is currently unresolved.

References

This opens the following intriguing question: Does there exist a neural network that is able to reach a generalization rate of α{-1} on data drawn from the RAF model and at the same time memorize the facts? Our work indicates that linear and kernel methods are insufficient for that purpose. It is possible that a wide two-layer neural network with a trainable first layer (as opposed to fixed features, as we considered in this work) will achieve this goal. However, an analysis of such a neural network learning on the RAF data remains a technically open problem, which we leave for future investigation.

The Rules-and-Facts Model for Simultaneous Generalization and Memorization in Neural Networks  (2603.25579 - Farné et al., 26 Mar 2026) in Section 3.4 (The large-α generalization rate)