Generalization beyond the single benchmark

Ascertain whether the empirical improvements reported for Bilevel Autoresearch on the GPT pretraining benchmark generalize to different model sizes, alternative training budgets, and other tasks beyond the single evaluated setting (50M parameters, 300-second budget, RTX 5090).

Background

The paper evaluates Bilevel Autoresearch on a single setup: GPT pretraining at 50M parameters with a fixed 300-second training budget on an RTX 5090 GPU. The reported gains primarily stem from Level 2 mechanism generation and injection, which produced a fivefold improvement over the baseline inner loop.

However, the authors explicitly note that this evaluation is limited to one benchmark and that it is not established whether the same improvements would hold across different model sizes, training budgets, or tasks. This creates an explicit generalization gap that needs to be tested in broader settings.

References

Generalization to other model sizes, training budgets, or tasks is unproven.

Bilevel Autoresearch: Meta-Autoresearching Itself  (2603.23420 - Qu et al., 24 Mar 2026) in Subsection "Limitations" (Single benchmark), Section Discussion