The FM Agent

Published 30 Oct 2025 in cs.AI | (2510.26144v1)

Abstract: LLMs are catalyzing the development of autonomous AI research agents for scientific and engineering discovery. We present FM Agent, a novel and general-purpose multi-agent framework that leverages a synergistic combination of LLM-based reasoning and large-scale evolutionary search to address complex real-world challenges. The core of FM Agent integrates several key innovations: 1) a cold-start initialization phase incorporating expert guidance, 2) a novel evolutionary sampling strategy for iterative optimization, 3) domain-specific evaluators that combine correctness, effectiveness, and LLM-supervised feedback, and 4) a distributed, asynchronous execution infrastructure built on Ray. Demonstrating broad applicability, our system has been evaluated across diverse domains, including operations research, machine learning, GPU kernel optimization, and classical mathematical problems. FM Agent reaches state-of-the-art results autonomously, without human interpretation or tuning -- 1976.3 on ALE-Bench (+5.2\%), 43.56\% on MLE-Bench (+4.0pp), up to 20x speedups on KernelBench, and establishes new state-of-the-art(SOTA) results on several classical mathematical problems. Beyond academic benchmarks, FM Agent shows considerable promise for both large-scale enterprise R&D workflows and fundamental scientific research, where it can accelerate innovation, automate complex discovery processes, and deliver substantial engineering and scientific advances with broader societal impact.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a general-purpose framework that synergizes LLM-based reasoning with large-scale evolutionary search to autonomously drive scientific and engineering discovery.
It details a two-stage process—Cold Start and Evolve—with adaptive sampling, distributed asynchronous evaluation, and optional expert feedback to robustly optimize diverse tasks.
Key results include competitive ML benchmarks with a 96.89% valid submission rate, improved combinatorial optimization scores, GPU kernel speedups up to 20.77×, and new solutions in mathematical problem solving.

FM Agent: A General-Purpose Multi-Agent Evolutionary Framework for Autonomous Scientific and Engineering Discovery

Introduction and Motivation

The FM Agent framework addresses the challenge of automating complex scientific and engineering discovery by integrating LLM-based reasoning with large-scale evolutionary search. The system is designed to operate across diverse domains, including machine learning, combinatorial optimization, GPU kernel generation, and mathematical problem solving. The core innovation lies in the synergistic orchestration of multi-agent LLMs, adaptive evolutionary strategies, domain-specific evaluators, and a distributed asynchronous infrastructure.

Figure 1: The workflow of FM Agent System to tackle a complex algorithm problem.

System Architecture

FM Agent is structured as a two-stage process: a Cold Start Stage and an Evolve Stage. The Cold Start Stage leverages multiple generative agents to rapidly construct a diverse, high-quality initial solution pool, optionally incorporating expert-in-the-loop guidance. This is followed by the Evolve Stage, which partitions the initial solutions into islands and applies large-scale, population-based evolutionary search with adaptive diversity-driven sampling and periodic inter-island communication.

Figure 2: Framework of FM Agent with Cold Start Stage and Evolve Stage, both account for the final performance.

The distributed execution is orchestrated via Ray, enabling asynchronous, high-throughput evaluation and synthesis across a large cluster.

Figure 3: Architecture of the Large-Scale Distributed Evolutionary Cluster.

A human-interactive feedback module is optionally available, supporting real-time expert intervention and knowledge base integration via RAG, further enhancing the system's adaptability and interpretability.

Domain Applications

Machine Learning Engineering

FM Agent automates the end-to-end ML workflow, including feature mining, feature combination, model fusion, and pipeline construction. The system demonstrates the ability to autonomously discover high-value features and construct competitive models, reducing the need for manual intervention.

Figure 4: Performance of Agents on MLE-Bench: Medal Rate (\%), evaluating FM Agent across real-world machine learning tasks sourced from Kaggle competitions.

On MLE-Bench, FM Agent achieves a valid submission rate of 96.89%, surpasses the median human submission in 51.56% of tasks, and attains a gold medal rate of 22.67%, outperforming all other agents on the leaderboard, especially on medium and high complexity tasks.

Combinatorial Optimization

FM Agent is applied to NP-hard combinatorial optimization problems, autonomously designing novel heuristics, augmenting existing solvers, and directly constructing high-quality solutions. The evolutionary search is capable of discovering strategies that diverge from conventional human designs, as evidenced by performance on ALE-Bench.

Figure 5: Performance of Agents on the ALE-Bench Lite, denoting the SOTA capability of FM Agent in tackling challenging heuristic-driven tasks from AtCoder Completion.

FM Agent achieves a mean overall score of 1976.3 on ALE-Bench Lite, exceeding the specialized ALE-Agent by 5.2% and the iterative refinement baseline by 64.6%. It reaches the expert "Yellow" tier on 50% of tasks, indicating robust long-horizon optimization capabilities.

GPU Kernel Generation

The system reformulates kernel optimization as an autonomous, data-driven process, iteratively generating and evaluating CUDA kernels. FM Agent achieves 2.08× to 20.77× speedups over torch.compile on KernelBench, maintaining strict numerical accuracy and outperforming both agentic and RL-based SOTA baselines.

Figure 6: Comparison of speedup achieved relative to torch.compile. The dashed line at 1 indicates parity with torch.compile.

Mathematical Problem Solving

FM Agent demonstrates the ability to autonomously discover near-optimal solutions to open mathematical problems, such as circle packing, uncertainty inequalities, and geometric minimization tasks. The system outperforms AlphaEvolve and previous best-known results on several benchmarks.

Figure 7: Final Solution for the 26-Circle Packing Problem

Case Studies

Automated Feature Engineering

On the American Express Default Prediction task, FM Agent incrementally constructs feature sets that yield continuous performance improvements, with a cumulative score increase of +0.003 under a fixed downstream model. The agent autonomously discovers non-trivial temporal and risk-aware features, demonstrating the effectiveness of evolutionary feature mining.

Figure 8: Convergence of the evaluation score on the American Express Default Prediction task, where the score is the task’s original metric and higher values indicate better performance.

Kernel Optimization in Production

In optimizing the CosyVoice2-0.5B Flow Matching Decoder, FM Agent rapidly identifies operator fusion and loop unrolling strategies for simple kernels, and persistently explores tiling and shared memory strategies for complex GEMM kernels, achieving rapid and sustained speedup improvements.

Figure 9: Speedup convergence of kernels in the CosyVoice2-0.5B Flow Matching Decoder against the official PyTorch-based implementation. FeedForward (fusion) and SinusoidalPosEmb (unrolling) converge quickly, while TimestepEmbedding shows slower convergence as it requires exploration of shared memory tiling.

Mathematical Discovery

FM Agent achieves new SOTA results on the 26-circle packing problem, uncertainty inequalities, and geometric minimization, demonstrating the capacity to integrate domain knowledge, symbolic reasoning, and evolutionary search.

Ablation Studies

Comprehensive ablation studies on the ahc016 task confirm that each architectural component—adaptive sampling, cold start, and island model—contributes significantly to performance. The full system achieves the highest final score, with adaptive sampling yielding a 10.99% improvement over top-k sampling and a 58.26% improvement over random sampling. FM Agent also demonstrates faster convergence and higher robustness compared to open-source baselines.

Figure 10: Ablation study results of FM Agent on the ahc016 task. Each curve displays the performance of a different experimental setting, averaged over five independent runs (where a higher combined score is better). The shading indicates the standard deviation.

Figure 11: Left: Comparison of different sampling methods on the ahc016 task. Right: Compared with open-source baseline on the ahc016 task. Each curve displays the performance of a different experimental setting, averaged over five independent runs (where a higher combined score is better). The shading indicates the standard deviation.

Implications and Future Directions

FM Agent demonstrates that the integration of LLM-based reasoning with large-scale evolutionary search and domain-specific evaluation can autonomously achieve and surpass SOTA performance across a wide range of scientific and engineering domains. The system's architecture is highly modular and extensible, supporting both fully autonomous operation and expert-in-the-loop augmentation. The results suggest that such agentic frameworks can serve as general-purpose research accelerators, automating complex discovery processes and reducing reliance on human expertise.

Theoretical implications include the validation of open-ended, multi-agent evolutionary paradigms as a viable approach for automating algorithmic and scientific innovation. Practically, FM Agent's distributed, asynchronous infrastructure and adaptive sampling strategies provide a blueprint for scaling agentic research systems to industrial and scientific workloads.

Future developments may focus on tighter integration of symbolic reasoning, more sophisticated knowledge retrieval and transfer mechanisms, and further scaling of distributed evolutionary computation. The demonstrated ability to autonomously discover novel algorithms, optimize system-level performance, and solve open mathematical problems positions FM Agent as a foundational system for the next generation of AI-driven scientific discovery.

Conclusion

FM Agent establishes a robust, general-purpose framework for autonomous research and engineering, combining LLM-driven reasoning, evolutionary search, and scalable distributed infrastructure. The system achieves strong empirical results across machine learning, combinatorial optimization, kernel generation, and mathematics, with ablation studies confirming the efficacy of its architectural components. FM Agent's design and results have significant implications for the automation of scientific discovery and the development of self-improving AI research agents.

Markdown