Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design

Published 15 May 2026 in cs.AI | (2605.15871v1)

Abstract: Toward recursive self-improvement, we investigate LLM agents autonomously designing foundation models beyond standard Transformers. We introduce a dual-framework approach: AIRA-Compose for high-level architecture search, and AIRA-Design for low-level mechanistic implementation. AIRA-Compose uses 11 agents to explore fundamental computational primitives under a 24-hour budget. Agents evaluate million-parameter candidates, extrapolating top designs to 350M, 1B, and 3B scales. This yields 14 architectures across two families: AIRAformers (Transformer-based) and AIRAhybrids (Transformer-Mamba). Pre-trained at 1B scale, these consistently outperform Llama 3.2 and Composer-found baselines. On downstream tasks, AIRAformer-D and AIRAhybrid-D improve accuracy by 2.4% and 3.8% over Llama 3.2. Furthermore, AIRA-Compose finds models with highly efficient scaling frontiers: AIRAformer-C scales 54% and 71% faster than Llama 3.2 and Composer's best Transformer, while AIRAhybrid-C outscales Nemotron-2 by 23% and Composer's best hybrid by 37%. AIRA-Design tasks 20 agents with writing novel attention mechanisms for long-range dependencies and high-performing training scripts. On the Long Range Arena benchmark, agent-designed architectures reach within 2.3% and 2.6% of human state-of-the-art on document matching and text classification. On the Autoresearch benchmark, Greedy Opus 4.5 achieves 0.968 validation bits-per-byte under a fixed time budget, surpassing the published minimum. Together, these frameworks show AI agents can autonomously discover architectures and algorithmic optimizations matching or surpassing hand-designed baselines. This establishes a powerful paradigm for discovering next-generation foundation models, marking a clear step toward recursive self-improvement.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper presents dual frameworks, AIRA-Compose and AIRA-Design, which autonomously search and refine neural architectures to outperform established models.
The paper leverages ensembles of LLM-powered agents to explore vast combinatorial design spaces, achieving improvements of 2.4% to 3.8% in test accuracy over benchmarks.
The paper demonstrates a recursive self-improvement approach and hybridization of primitives, resulting in enhanced compute efficiency and scalable performance.

Agentic Discovery of Neural Architectures with AIRA-Compose and AIRA-Design

Introduction

This work presents a comprehensive study of agent-driven neural architecture search (NAS) and low-level mechanistic design targeting foundation models, specifically moving beyond the standard transformer paradigm. The proposed dual framework comprises AIRA-Compose (high-level architecture search) and AIRA-Design (low-level implementation and optimization), each leveraging ensembles of LLM-powered agents orchestrated through an agentic research harness. These approaches are motivated by the need to autonomously discover hybrid and non-obvious architectures in the exponentially growing design space for LLMs, as well as to facilitate recursive self-improvement (RSI) within the AI research workflow.

Agentic Architecture Search: AIRA-Compose

AIRA-Compose tasks LLM agents with neural architecture design within a combinatorial search space formulated by computational primitives: Multi-Head Attention (mA), MLP, and Mamba SSM. The process utilizes small-scale proxy tasks (16 layers, $10^4$ – $10^5$ arrangement space) and scales up the most performant discovered patterns to 350M, 1B, and 3B parameter regimes. Candidate architectures are identified and refined within a stringent compute budget using iterative draft-debug-improve-analyze cycles, and subsequently aggregated and extrapolated using Composer-inspired procedures.

Empirical Results

Performance Superiority:

At the 1B scale and fixed token budget, agent-discovered architectures consistently surpassed Llama 3.2 and state-of-the-art NAS frameworks. For instance, AIRAformer-D and AIRAhybrid-D yielded test accuracy improvements of 2.4% and 3.8% over Llama 3.2, respectively, across six downstream tasks.
Scaling analysis demonstrated that attention-heavy variants (e.g., AIRAformer-C, AIRAformer-D) attain steep scaling slopes and favorable intercepts in isoFLOP performance, indicating enhanced compute efficiency. AIRAformer-C scaled 54–71% faster than both Llama 3.2 and the best Composer-found transformer. AIRAhybrid-C similarly scaled faster than Nemotron-2 and Composer-found hybrids.

Architecture Diversity:

The agentic NAS successfully identified 14 robust architectures, bifurcated into AIRAformers (transformer-based) and AIRAhybrids (hybrid transformer–Mamba) families, marked by nontrivial interleaving of primitives.
Pareto analysis of latency-validation loss confirmed that AIRAhybrids (notably D variants) advance the efficiency frontier relative to all baselines at 1B scale.

Methodological Nuances

Agents are not limited to simple search heuristics; they perform semantically informed design exploration and generalize their prior architectural knowledge. The combinatorial search space is systematically covered with less than 0.01% exploration for three-primitive setups, highlighting the sample efficiency and hypothesis-driven nature of the agentic search paradigm.

Mechanistic Design and Optimization: AIRA-Design

AIRA-Design introduces agentic low-level code synthesis for two classes of challenges: constructing novel efficient attention mechanisms (targeting Long Range Arena; LRA) and optimizing training scripts for small LMs under fixed-time budgets (targeting Autoresearch).

Long Range Arena (LRA)

Multiple agents implemented and optimized sub-quadratic attention mechanisms compatible with JAX/Flax infrastructure. On LRA sequence modeling benchmarks (ListOps, IMDb/Text, Retrieval), the best agent-generated models achieved accuracies within 2.3% (document matching) and 2.6% (text classification) of human SOTA.
Analysis revealed that strong agents successfully adapted and recombined established primitives (e.g., Performer, Longformer) and produced competitive solutions across multiple LRA tasks. However, fundamentally novel attention mechanisms were not observed, indicating the current ceiling of LLM-powered agentic mechanistic innovation.

Autoresearch: Training Script Optimization

Agents, under a fixed wall-clock training budget, optimized a GPT nanochat language modeling script; the best agent (Greedy Opus 4.5) achieved a validation BPB of 0.968—outperforming published reference minima.
Literature and code augmentation for agents modulated their search strategies and led to improved convergence for select agents, although improvements were not universal. The optimization process, driven by full-file regeneration at each agent step, typically involved compound modifications, complicating ablation and interpretability of performance gains.

Theoretical and Practical Implications

Recursive Self-Improvement

The articulated frameworks operationalize agentic recursive self-improvement for neural architectures. Agents do not merely optimize known configurations but systematically explore architectural, algorithmic, and optimization innovations within a flexible research harness. This directly addresses the challenge of human-in-the-loop bottlenecks and the myopic exploration imposed by human intuition-driven heuristics.

Hybrid Model Proliferation

AIRA-Compose strengthens the evidence that hybridizing attention, MLP, and SSM primitives yields architectures with superior scaling properties and compute efficiency. The observed scaling behaviors suggest that agentic search can surpass both fixed algorithmic design and conventional NAS approaches for foundational model development.

Engineering versus Scientific Innovation

While agent-driven code synthesis in AIRA-Design attained SOTA-adjacent results on LRA and Autoresearch, genuine algorithmic innovation remains elusive. Agents excelled at synthesizing and adapting from prior exemplars but rarely yielded fundamentally new algorithms. This delineates a current limitation and targets future research avenues in agentic, open-ended discovery.

Benchmarking and Generalizability

All tasks conform to the AIRS-BENCH specification, facilitating consistent and modular benchmarking. The agentic templates are LLM-agnostic and harness-agnostic, supporting rapid extension to new NN search spaces, domains, and agent architectures.

Limitations and Forward Directions

Proxy–Target Performance Gap: The present reliance on small-scale proxies in AIRA-Compose introduces nontrivial extrapolation risk when scaling designs.
Search Space Coverage: Agents currently optimize over primitive arrangement with fixed hyperparameters; further opening the search space to normalization and architectural hyperparameters is needed to fully leverage their capabilities.
Agentic Aggregation: The aggregation and scaling-up steps remain non-agentic and represent a natural target for further research to realize end-to-end agentic discovery pipelines.
Mechanistic Novelty: LLM agents require architectural advances in code reasoning and search policy (e.g., integrating persistent state, tool-use, and robust debugging) to transcend engineering synthesis and approach genuine algorithmic discovery.
Generalization Beyond Provided Context: While code and literature augmentation conferred gains, systematic integration and validation of research knowledge during agentic optimization remains an open challenge.

Conclusion

AIRA-Compose and AIRA-Design substantiate the efficacy of agentic frameworks to autonomously discover neural architectures and algorithmic optimizations that rival or surpass established baselines in both high-level and low-level NAS regimes. The methodologies provide a scalable, modular foundation for recursive self-improvement in AGI and are amenable to generalization across future AI research domains. However, the boundary between engineering synthesis and scientific innovation is as yet unbroken; further advances in agentic reasoning, scaffolding, and interactive learning will be essential to drive agent-based systems toward open-ended, autonomous AI research.