optimize_anything: A Universal API for Optimizing any Text Parameter
Abstract: Can a single LLM-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system-supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs-achieves state-of-the-art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score-only feedback, and that multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework. We open-source optimize_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa .
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about
The paper introduces โoptimize_anything,โ a tool that helps an AI improve almost any text-based blueprintโlike code, prompts, or even a full agentโs designโby trying ideas, getting scored, reading feedback, and then making smarter versions. The big idea is simple: if you can write something as text and build a way to score it, this system can help optimize it.
The main questions the paper asks
- Can one AI-based system improve very different kinds of things (like GPU code, math prompts, or puzzle-solving agents) using the same method?
- Is it better to learn from several related tasks at once rather than solving each one alone?
- Does detailed feedback (not just a number score) help the AI learn faster and do better?
- Can the system create solutions that work well on brand-new, unseen examples?
How the system works (in everyday terms)
Think of the process like revising a school project with a helpful teacher:
- You start with a draft (a โtext artifactโ like code, a prompt, or an agentโs rules).
- You hand it to a strict judge (the โevaluatorโ) that:
- Gives a score (โHow good is this?โ).
- Gives feedback notes (โWhat went wrong?โ), called Side Information (SI).
- An AI โproposerโ reads the notes and writes a better draft.
- Repeat until you get strong results.
Hereโs what makes it practical:
- One simple interface for many problem types: If it can be represented as text (code, policy rules, prompts), this system can try to improve it.
- Three ways to search for better solutions:
- Single-task: Improve one thing (like one circle-packing program).
- Multi-task: Improve several related things together, so tricks learned on one can help the others (like many GPU kernels).
- Generalization: Improve one artifact so it works well on new, unseen examples (like prompts or agents tested on new problems).
- Side Information (SI) is like a teacherโs margin notes: not just โYou got 70%,โ but โLine 12 has a bug,โ or โThis part is slow,โ or even a picture showing what went wrong.
- Pareto-based selection is like keeping a team of specialists, not only the one with the best average: the system keeps candidates that are the best in at least one area (e.g., fastest, most accurate, least costly), so different strengths can be combined over time.
- โSeedlessโ mode can start from a plain English goal if you donโt have any starting code or prompt.
What they tested and what they found
The authors tried the system on very different tasks and reported strong results:
- Agent architecture for ARC-AGI puzzles (like pattern and logic puzzles): Evolved a simple agent into a 4-stage system and raised accuracy from 32.5% to 89.5%โnearly triple.
- Cloud scheduling (deciding how to run jobs cheaply without missing deadlines): Cut costs by up to about 40% by inventing smarter, provider-aware strategies.
- GPU code (CUDA kernels) for speeding up PyTorch operations: 87% of generated kernels matched or beat the baseline speed; many were 10โ20% faster.
- Math contest prompting (AIME): A better prompt boosted a small modelโs score from 46.67% to 60.00% on a new test year, beating a known prompt-optimization method.
- Circle packing (fitting circles in a square without overlaps): Beat a previous systemโs best result on a well-known instance (n=26), and did it with fewer tries.
- Images via SVG/CAD: Produced designs that human judges preferred over basic attempts, with much higher โqualityโ scores from a vision model.
Two extra takeaways stood out:
- Detailed feedback (SI) matters a lot: Across multiple tests, having actionable notes (like compiler errors or per-aspect scores) made learning 4โ6ร faster and led to better final scores than just giving a single number.
- Learning across tasks helps when tasks share patterns: In GPU code, optimizing many related kernels together outperformed solving each alone with the same budget. However, for unrelated tasks (like circle packing with different numbers of circles), multi-task learning didnโt help and could even add noise.
Why these results are important
- One method, many domains: Instead of building a different optimizer for each specialty (prompts, code, agents, images), you can use the same โwrite text โ score โ read feedback โ improveโ loop everywhere.
- Faster progress with better feedback: Turning โYou failedโ into โHereโs exactly why you failedโ lets the AI propose targeted fixes, much like a student improves faster with specific teacher comments.
- Practical gains: Cheaper cloud runs, faster GPU code, smarter agents, and better prompts are immediately useful in real systems.
- Accessible experimentation: Because artifacts are just text, the same interface works across tasks. Non-experts in optimization can still contribute by writing good evaluators that return helpful feedback.
What this could mean going forward
If many problems can be expressed as โimprove this text and score it,โ then a single, open-source tool can accelerate progress across fieldsโcoding, systems, science, and designโwithout custom-made optimizers for each one. That said, results depend on the quality of the AI and the evaluator, and evaluations can be costly. Still, the paper shows a promising path: using detailed feedback and a diverse pool of strong candidates, one unified system can find creative, high-performing solutions across very different kinds of challenges.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a consolidated list of concrete gaps and unresolved questions that future work could address to strengthen or extend the claims in the paper.
- Reproducibility with open models: Most results rely on closed, premium models (e.g., GPTโ5, Gemini 3, Claude 4.6). It remains unclear how well the system performs with strong open-source LLMs/VLMs (e.g., Llama, Qwen, Phi) across all domains and modes.
- Proposerโexecutor model mismatch: Beyond the coding-skills transfer across two Claude variants, the paper does not systematically study how performance changes when the proposer differs from the runtime model (e.g., proposer = small open model, executor = larger or different family), or when the proposer and agent models are intentionally decoupled.
- Sensitivity to evaluator noise and nondeterminism: Optimization stability under noisy evaluators (e.g., VLM-based image scores, GPU benchmarking variance) is not quantified. There is no analysis of repeated trials, variance, or robustness to stochastic evaluators.
- Mis-specification and reward hacking: The system could exploit evaluator loopholes (e.g., reading ground truth from SI, hardcoding answers, test leakage, code-injection or unsafe syscalls). Methods to detect and prevent reward hacking and to sandbox evaluations are not presented.
- Safety and security of code execution: The framework executes generated code (e.g., exec(), CUDA kernels) but does not detail sandboxing, resource limits, network isolation, or permission controls to mitigate malicious or accidental harm.
- SI quality and failure modes: While SI improves performance, the effects of low-quality, misleading, incomplete, or adversarial SI are not studied. There is no guidance on SI design trade-offs or automated SI curation/summarization for long traces.
- Generalizable SI schemas: The โtyped SI primitiveโ is proposed, but there is no systematic evaluation of reusable, cross-domain SI schemas or tools for auto-extracting SI from logs/telemetry with minimal domain engineering.
- Ablations on algorithmic components: The paper does not isolate the contribution of several core design choices, including:
- Pareto frontier vs. simpler selection (e.g., topโk by average score).
- Minibatch reflection (2โ3 examples) vs. full-batch reflection.
- The refiner stepโs effect on failure rate and sample efficiency.
- Content-addressed caching vs. naive deduplication.
- Theoretical guarantees: There is no theoretical analysis of convergence, regret, or sample complexity for Pareto-based text optimization with SI, especially under noisy or non-stationary evaluators.
- Scaling limits of multi-task search: The approach is shown on up to ~200 tasks, but there is no study of scaling to thousands of tasks, frontier growth dynamics, memory/computation overheads, or strategies to keep the frontier tractable.
- Automatic detection of negative transfer: Multi-task search can hurt when tasks are independent (e.g., circle packing with different N). The framework does not include task-relatedness estimation, task clustering, or adaptive mode switching to avoid negative transfer.
- Overfitting to validation in generalization mode: Repeated validation evaluation can lead to implicit overfitting. Guardrails (e.g., early stopping, restricted val queries, nested splits) are not described or evaluated.
- CUDA kernel generality and baselines: Kernel results are reported on a V100 and 31 ops; portability to other GPUs/architectures (A100/H100/consumer GPUs), tensor shapes, and batch sizes is untested. Comparisons against specialized autotuners (e.g., TVM/Ansor, Triton autotuning) and across more comprehensive kernel suites are missing.
- Realism of cloud scheduling evaluation: Results are from ADRS-like benchmarks; real-world deployment factors (failures, forecasting errors, bursty workloads, fairness/SLAs, multi-tenant interference) and robustness to shifting distributions are not investigated.
- Constraint handling and multi-objective trade-offs: The framework uses Pareto selection but lacks a principled way to enforce hard constraints (e.g., deadlines, correctness) while optimizing other metrics, or to specify and tune trade-offs among competing objectives.
- ARC-AGI evaluation rigor: The 89.5% result is promising, but:
- Baselines against strong agent frameworks and state-of-the-art ARC systems are not provided.
- Potential training-data contamination or model priors are not discussed.
- Robustness across alternative ARC splits or related benchmarks (e.g., ARC-DA, different train/val/test partitions) is not evaluated.
- Image evaluation validity: VLM-based scoring can be biased or gamed. Human evaluation used only five raters; no inter-rater reliability, larger blinded studies, or use of standardized perceptual metrics are reported.
- Seedless mode reliability: Only a single 3D modeling anecdote is presented. The consistency, success rate, and failure cases of seedless optimization across tasks remain unquantified.
- Cost and energy accounting: Reported dollar costs exclude detailed compute/energy metrics and do not decompose costs across propose vs. evaluate steps per domain, making it hard to predict operational expense at scale.
- Frontier diversity metrics: Although Pareto diversity is claimed to prevent premature convergence, there are no quantitative diversity measures (e.g., edit distances, behavioral diversity) or analyses of diversityโperformance trade-offs.
- Backend-agnostic claims: The interface is said to be backend-agnostic, but experiments primarily use a GEPA-derived backend. Comparative studies with alternative optimizers (e.g., evolutionary strategies, Bayesian optimization over edits, gradient-based methods like TextGrad) are absent.
- Fair cross-framework comparisons: Apart from a controlled circle-packing run, most comparisons with prior systems are not performed under matched budgets, hardware, or identical evaluators/models, leaving fairness and effect sizes uncertain.
- Task grouping and curriculum in multi-task mode: How to select, order, or cluster tasks to maximize transfer is not explored. There is no study of curricula, task-weighting, or adaptive sampling.
- Handling continuous and non-text artifacts: The framework assumes text-serializable artifacts. Effectiveness and fidelity of text proxies for inherently continuous/binary artifacts (e.g., weights, bitstreams) are not evaluated.
- Robustness to long SI/logs: Strategies for truncation, summarization, or structured extraction from very long SI (e.g., deep traces, profiler logs) are not provided, raising questions about scalability of reflection prompts.
- Data governance and copyright: Generated code/algorithms may reflect training data. Licensing, attribution, and compliance implications are not addressed.
- Ethical considerations: Optimizing policies (e.g., cloud scheduling) can impact fairness, energy footprint, and cost distribution. The paper does not discuss ethical guardrails or auditing for unintended consequences.
- Automatic SI design: There is no method to learn or auto-suggest the most actionable SI for a given domain (e.g., via ablation-based attribution of SI components), which would reduce dependence on domain experts.
- Noisy benchmark protocols for kernels and agents: The number of runs, warm-up strategies, and statistical testing for performance measurements (e.g., GPU timing variance, flaky tests in agents) are not described.
- Stopping criteria and budget allocation: Guidelines for stopping conditions, adaptive budget allocation across tasks, and balancing exploration vs. exploitation are not specified.
- Transferability across hardware and environments: Apart from coding-skills transfer across two Claude variants, cross-hardware transfer (e.g., kernels across GPUs/CPUs) and cross-environment transfer (e.g., schedulers across cloud providers) are not empirically studied.
- Detecting and mitigating catastrophic proposals: The framework adds a โrefiner step,โ but there is no analysis of catastrophic proposal rates (e.g., broken kernels, infinite loops) or recovery strategies beyond simple filtering.
- Integration with human-in-the-loop feedback: While SI can include arbitrary feedback, the paper does not explore mechanisms for incorporating human ratings/edits efficiently (e.g., active learning, preference optimization) or measuring the gains from occasional human interventions.
Practical Applications
Immediate Applications
The paperโs unified โtext artifact + evaluator + side information (SI)โ approach can be deployed now in many settings by wrapping existing workflows with evaluate(candidate) โ (score, diagnostics) and letting optimize_anything iterate. Below are concrete, sector-linked use cases, including likely tools/workflows and key dependencies.
- Repository-specific coding โskillsโ for AI code assistants (Software/Dev Productivity)
- What: Evolve natural-language โskillsโ and best-practice snippets tailored to a codebase to boost AI coding agentsโ task completion and reduce time-to-fix (as demonstrated on Bleve).
- How: Integrate optimize_anything into a repoโs dev container; evaluator runs agent on task suites; SI captures tool-call traces, compile/test results, and error logs.
- Tools/workflows: VS Code/JetBrains extension; CI job that re-trains skills nightly; LangChain/DSPy pipeline with GEPA backend.
- Dependencies/assumptions: Stable task harness; representative task set; access to capable LLMs; sandboxed execution for safety.
- CI/CD failure triage and flaky-test reduction (Software Engineering)
- What: Optimize agent prompts/flows that diagnose and fix failing tests faster using SI (stack traces, logs).
- How: Evaluator replays failures in isolated environments; SI includes error signatures and coverage gaps; search evolves โtriage playbooks.โ
- Tools/workflows: GitHub Actions/GitLab CI plugin; Pareto selection across failure categories.
- Dependencies: Reliable repro harness; log/trace collection; test isolation to avoid side effects.
- Enterprise prompt optimization for LLM apps (Software, Customer Support, Education)
- What: Automatically tune system prompts and few-shot exemplars to improve accuracy and reduce token cost (e.g., +13.3pp on AIME-2025).
- How: Evaluator grades task performance on held-out examples; SI includes sub-scores (reasoning steps, error types).
- Tools/workflows: โPromptOpsโ service integrated with LangSmith, DSPy, or in-house prompt store; A/B rollout gates.
- Dependencies: Ground-truth datasets or rubric-based auto-graders; cost budget for proposer/evaluator calls.
- Cloud cost/performance policy tuning (Cloud/DevOps)
- What: Discover scheduling policies for multi-cloud data transfers and spot/on-demand mix to cut costs (e.g., 40%+ for routing; 7.8% on spot policies).
- How: Evaluator simulates workloads; SI surfaces egress cost breakdowns, utilization timelines, SLA misses.
- Tools/workflows: Kubernetes scheduler plugin, Airflow policy tuner, Dataplane policy CRDs with nightly optimization runs.
- Dependencies: Realistic simulators or shadow-mode evaluation; access to price and availability traces; safe rollout strategy.
- GPU kernel auto-generation and tuning (ML Systems/HPC)
- What: Generate CUDA kernels that meet or beat PyTorch baselines (87% matched/faster) for custom ops.
- How: Evaluator compiles, verifies correctness vs. references, benchmarks performance; SI provides NVCC errors, profiler hints, speedup ratios.
- Tools/workflows: PyTorch/Triton plugin that swaps in optimized kernels; multi-task runs across operator sets; regression guardrails.
- Dependencies: Hardware access; correctness oracle; sandboxed compiler; allowances for driver/toolchain updates.
- Agent architecture optimization for task-specific assistants (Software/RPA)
- What: Auto-design multi-stage agent pipelines (e.g., analyze โ code โ verify โ fallback) that generalize to new tasks (e.g., ARC-AGI jump from 32.5%โ89.5%).
- How: Evaluator runs agent on train/val tasks; SI includes per-submodule traces, error/fallback counts, cost metrics.
- Tools/workflows: โAgent Architectโ workflow on top of DSPy; templated harnesses for customer support, research agents, or ETL automations.
- Dependencies: Robust task harness; budget for evaluation rollouts; secure code execution; red-team testing for safety.
- Hyperparameter and solver-heuristic evolution (ML/Optimization)
- What: Evolve solver code/heuristics and hyperparameter schedules that outperform generic HPO (paper reports wins vs. Optuna in a black-box suite).
- How: Evaluator returns objective values and subscores/violations; SI includes constraint diagnostics, convergence traces.
- Tools/workflows: Optuna/Weights & Biases integration with custom evaluator; nightly auto-tuner for pipelines.
- Dependencies: Deterministic evaluation or sufficient repeated trials; proper metric caching.
- SVG/CAD asset generation for design teams (Design/Marketing/Front-end)
- What: Produce on-brand icons, diagrams, and simple CAD shapes rated by a VLM on multiple visual aspects; humans preferred optimized outputs over baseline.
- How: Evaluator renders candidates and queries a VLM; SI provides per-aspect scores and annotated diff screenshots.
- Tools/workflows: Figma/Onshape plugin; batch optimization across a design system with multi-task Pareto selection.
- Dependencies: VLM access; brand/style rubrics; rendering pipeline; IP review.
- Academic algorithm prototyping and discovery (Academia/Operations Research)
- What: Rapidly prototype algorithmic ideas for hard problems (e.g., circle packing bilevel solvers) using SI-guided code evolution.
- How: Evaluator scores objective and returns constraint/geometry diagnostics and images; SI drives targeted shifts (e.g., switch to LP/SLP).
- Tools/workflows: Jupyter/Colab integration; versioned frontier artifacts; artifact caching for reproducibility.
- Dependencies: Accurate simulators/objective functions; compute budget; willingness to audit generated methods.
- Multi-service policy patterns via multi-task search (Cross-cutting DevOps/ML)
- What: Co-optimize families of similar tasks (e.g., GPU ops, microservice autoscaling rules) to discover reusable patterns.
- How: Shared Pareto frontier across tasks; SI highlights per-task wins/failures; each task exports its own best artifact.
- Tools/workflows: Organization-wide โpattern frontierโ for ops and ML teams; templated evaluators.
- Dependencies: Tasks must share underlying structure; evaluation isolation; governance for cross-service adoption.
- SI Instrumentation SDK for evaluators (Platform/Tooling)
- What: Standardize capturing compiler errors, profiling summaries, sub-scores, and images as SI to accelerate convergence (4โ6ร speed-ups shown).
- How: Lightweight Python/Go hooks to collate logs/metrics into a side_info dict; optional oa.Image for visuals.
- Tools/workflows: Shared org SDK; conventions for per-metric scoring; dashboards for frontier evolution.
- Dependencies: Domain expertise to pick actionable diagnostics; privacy/PII filtering.
- Personal prompt and automation tuning (Daily Life/Productivity)
- What: Optimize prompts and small scripts for personal assistants (task lists, email sorting, study aids) to reduce friction.
- How: Evaluator measures task success/time saved on sample scenarios; SI includes failure categories and misclassifications.
- Tools/workflows: Local sandbox with API keys; scheduled re-optimization as habits change.
- Dependencies: Safe execution; clear success metrics; modest LLM budget.
Long-Term Applications
These leverage the same paradigm but require more research, robust verifications, broader integration, or regulatory clarity.
- Autonomous performance compiler integrated with toolchains (Software/HPC)
- What: Always-on kernel/pipeline optimizer embedded in compilers (CUDA/Triton/TVM), evolving code with SI from profilers.
- Potential products: โKernelSmith Proโ compiler pass; IDE-in-the-loop autotuning.
- Dependencies: Strong correctness guarantees; regression test suites; deterministic profiling; vendor cooperation.
- First-class cloud scheduler optimizer for Kubernetes and cloud providers (Cloud/DevOps)
- What: Optimize cluster-wide policies (bin-packing, spot preemption, data egress routing) with train/val simulation and safe rollout.
- Potential products: โK8s Scheduler Tunerโ with shadow-mode evaluation; cloud-native managed service.
- Dependencies: High-fidelity simulators; safety guardrails; multi-tenant SLA compliance; change-management workflows.
- Safety-critical decision policy tuning (Healthcare/Transportation/Energy)
- What: Optimize scheduling/triage/dispatch policies using rich SI (e.g., wait times, fairness, risk), with formal constraints.
- Potential products: Co-optimizer for OR staffing, EMS dispatch, grid load shedding.
- Dependencies: Regulatory approvals; interpretability and audit trails; rigorous offline evaluation and formal verification; bias/fairness safeguards.
- Robotics controller and planning code evolution (Robotics/Industrial Automation)
- What: Evolve high-level controller code and planning heuristics using SI (collision metrics, latency, energy).
- Potential products: โPlanner Architectโ for warehouse/AGV fleets; sim-to-real adapters.
- Dependencies: High-fidelity simulators; safety envelopes; real-time constraints; robust sandboxing.
- Finance strategy and risk policy optimization (Finance)
- What: Optimize rule-based strategies and risk controls with SI (PnL attribution, drawdowns, VaR breaches).
- Potential products: โStrategy Tunerโ for backtesting engines; policy evolution for credit risk models.
- Dependencies: Strict compliance; out-of-sample validation; market regime shifts; data access and privacy.
- Public policy and urban systems planning (Policy)
- What: Optimize routing, zoning, or subsidy allocation policies with simulation feedback as SI (equity metrics, congestion, emissions).
- Potential products: Civic โPolicy Labโ integrating traffic/agent-based models; participatory what-if tooling.
- Dependencies: Stakeholder governance; transparent objectives; robust, unbiased simulators; legal frameworks.
- Education content and tutor pipeline optimization (Education)
- What: Evolve curricula prompts/agent flows for tutoring systems to improve learning outcomes on held-out cohorts.
- Potential products: โTutor Architectโ that tunes hinting strategies and assessment rubrics.
- Dependencies: Ethical study designs; ground-truth outcomes; privacy-preserving data; pedagogy-informed SI.
- Cross-organization multi-task pattern sharing (Platform/Enterprise)
- What: Federated optimization across business units to share strategies via Pareto frontier without sharing raw data.
- Potential products: Frontier registry with privacy constraints; pattern โpull requests.โ
- Dependencies: Federated evaluation; IP and privacy agreements; standard SI schemas.
- Standards for SI and evaluator contracts (Ecosystem/Policy)
- What: Define interoperable SI taxonomies for domains (compilers, schedulers, agents) to enable plug-and-play optimizers.
- Potential products: Open specs, conformance suites, and reference SDKs.
- Dependencies: Industry consortia; domain consensus; security/PII guidelines.
- General-purpose โoptimize-anythingโ marketplace (Platform)
- What: Catalog of artifacts (prompts, policies, kernels, agents) with provenance and validation scores; one-click deployment.
- Potential products: Artifact registry with automated retesting on updates; license-aware distribution.
- Dependencies: IP/licensing frameworks; reproducibility pipelines; trust signals and audits.
- Verified agent architectures for critical workflows (Software/GRC)
- What: Auto-designed, formally checked agent pipelines with enforced fallbacks and limits.
- Potential products: โVerified Agent Architectโ integrating with policy-as-code (e.g., OPA) and formal methods.
- Dependencies: Formal verification integration; runtime monitors; red-teaming; provable safety constraints.
Cross-cutting assumptions and dependencies (affecting feasibility across applications)
- Evaluator quality is pivotal: must produce reliable scalar scores and actionable SI; poor SI reduces gains.
- Task relatedness matters for multi-task benefits; unrelated tasks can degrade results (as observed in circle packing).
- Access to capable LLMs/VLMs and compute budget; evaluation often dominates cost.
- Secure sandboxing for executing generated code; strong test oracles for correctness.
- Text-serialization of artifacts is required; non-text domains need faithful text proxies.
- Governance: audit trails, explainability, bias checks, and change management for production integration.
- Data privacy and compliance for domains with sensitive information (healthcare, finance, public sector).
These applications translate the paperโs core findingโthat many optimization problems can be reframed as text artifact search guided by rich diagnosticsโinto concrete tools and workflows that teams can adopt today, while outlining ambitious but attainable directions as evaluators, standards, and verification mature.
Glossary
- ADRS benchmark: A benchmark suite for evaluating cloud infrastructure algorithms. "We optimize two cloud infrastructure algorithms from the ADRS benchmark [6]."
- ADAS: Automated Design of Agentic Systems; a method for searching agent architectures. "ADAS [11] and AFlow [29] search over agent architectures."
- AFlow: A framework for agent architecture search. "ADAS [11] and AFlow [29] search over agent architectures."
- AIME: The American Invitational Mathematics Examination; used here as a benchmark for prompt optimization. "We optimize a system prompt for GPT-4.1-mini on AIME (American Invitational Mathematics Examination) competition prob- lems."
- AlphaEvolve: An LLM-evolution framework using MAP-Elites to discover algorithms. "AlphaEvolve [18] pioneered the LLM-evolution paradigm, using Gemini models with island-based MAP-Elites [17] to discover algorithms for Google's infrastructure."
- ARC-AGI: A benchmark of abstraction and reasoning tasks assessing general intelligence-like capabilities. "The optimization objective is for the artifact to generalize to unseen ARC-AGI [7] puzzles."
- Bayesian optimizer: A Bayesian optimization method for black-box functions. "For example, one cannot show a Bayesian optimizer a stack trace."
- Bilevel optimizer: An optimization scheme with upper- and lower-level problems solved in tandem. "The optimized algorithm is a bilevel optimizer: an LP over radii with dual-variable gradients for L-BFGS- B center optimization, augmented by CMA-ES exploration and diverse seeding strategies."
- build123d: A Python CAD modeling library used to generate parametric models. "We generate SVG code and CAD models (via build123d) for four image goals (Table 10 in Appendix H)."
- Circle packing: The optimization problem of arranging circles within a region to maximize packed size without overlap. "Circle Packing (num circles = 26)"
- CMA-ES: Covariance Matrix Adaptation Evolution Strategy, a derivative-free optimization algorithm. "augmented by CMA-ES exploration and diverse seeding strategies."
- content-addressed evaluation caching: A caching technique that avoids re-evaluating identical artifacts by hashing content. "content-addressed evaluation caching to avoid redundant expensive rollouts;"
- CUDA kernel: A GPU-executed function written for NVIDIAโs CUDA platform. "We generate CUDA kernels for 31 reference PyTorch op- erations from KernelBench [20]"
- Dijkstra routing: Shortest-path routing based on Dijkstraโs algorithm. "CloudCast achieves 40.2% cost savings over Dijkstra routing (Figure 3a),"
- EVOLVE-BLOCK markers: Special prompt markers used by prior frameworks to delimit evolvable code regions. "Specifically, optimize_anything doesn't require mutation prompts, task-specific templates, island configurations, or EVOLVE-BLOCK markers (all common in prior frameworks)."
- float4 vectorization: Using 4-element floating-point vectors to increase memory and compute throughput. "The evolved kernels employ techniques such as float4 vectorization, two-pass algorithms (compute statistics, then normalize), warp shuffle reductions, and shared memory tiling."
- GEPA: A reflective prompt evolution algorithm leveraging Pareto-based search. "GEPA [3] achieves state-of-the-art prompt optimiza- tion with generalization to unseen inputs, but is limited to prompts;"
- GEPAAdapter: An adapter applying GEPA-style optimization to full agent programs. "building on an earlier proof-of-concept with GEPAAdapter [1]."
- Generalization mode: An optimization setting where artifacts are tuned on training data to perform on unseen validation data. "Both use generalization mode with training/validation splits over infrastructure scenarios."
- KernelBench: A benchmark suite of PyTorch operations for evaluating generated GPU kernels. "We generate CUDA kernels for 31 reference PyTorch op- erations from KernelBench [20]"
- L-BFGS-B: Limited-memory BFGS algorithm with bound constraints. "an LP over radii with dual-variable gradients for L-BFGS- B center optimization"
- Linear programming (LP): Optimization of a linear objective subject to linear constraints. "an LP over radii"
- MAP-Elites: A quality-diversity evolutionary algorithm that explores diverse high-performing solutions. "island-based MAP-Elites [17]"
- memory coalescing: Aligning GPU memory accesses across threads to maximize bandwidth utilization. "insights discovered for one problem (e.g., how to handle memory coalescing) transfer to others"
- MIPROv2: A method targeting prompt and few-shot selection optimization. "MIPROv2 [19] similarly targets prompt and few-shot selection."
- Multi-task search: Jointly optimizing across multiple related tasks to enable cross-task transfer. "No prior system supports multi-task search, where solving a batch of related problems together enables cross-transfer of discovered optimization patterns."
- NVCC: NVIDIA CUDA Compiler for compiling CUDA kernels. "SI includes: (i) NVCC compiler errors with line numbers,"
- ON_DEMAND instances: Reliable, non-preemptible cloud compute instances. "deciding when to use cheap preemptible SPOT instances versus reliable ON_DEMAND instances to meet deadlines."
- OpenEvolve: An open-source, model-agnostic reimplementation of AlphaEvolve. "OpenEvolve [24] provides an open-source reimplementation with model-agnostic support."
- Optuna: A hyperparameter optimization framework for black-box optimization. "matching and outperforming Optuna in numerical optimization,"
- Pareto-based search: Selection strategy that maintains candidates excelling across different objectives without collapsing to an average. "We achieve these results by extending the Pareto-based search of Agrawal et al."
- Pareto dominance: A multi-objective relation where one solution is at least as good in all objectives and better in at least one. "per-example or per-metric Pareto dominance rather than aggregate scores,"
- Pareto frontier: The set of nondominated solutions in multi-objective optimization. "maintains a Pareto frontier: any candidate that is the best at something survives, even if its average is suboptimal."
- Reflexion: A self-correction technique using verbal reinforcement for agents. "Reflexion [25] uses verbal reinforcement for agent self-correction."
- Refiner step: A pre-evaluation pass that fixes common LLM generation errors to prevent failed executions. "a refiner step that catches common LLM generation artifacts (malformed code blocks, im- port errors, syntax issues) before evaluation"
- Scalable Vector Graphics (SVGs): A text-based vector image format for 2D graphics. "Scalable Vector Graphics (SVGs), or a system prompt, the structure is the same:"
- Seedless mode: Starting optimization without a seed artifact by bootstrapping from a natural-language objective. "Seedless mode makes the system accessible to users who can specify what they want but not implement it."
- Self-Refine: An approach where models iteratively improve their outputs using self-generated feedback. "Self-Refine [15] applies iter- ative self-feedback."
- Shared memory tiling: Organizing data into tiles in GPU shared memory to improve locality and throughput. "shared memory tiling."
- Side Information (SI): Diagnostic feedback returned by evaluators to guide LLM-driven revisions. "The evaluator returns a score plus Side Information (SI) - domain context explaining why."
- Slack ratio: A measure of scheduling slack relative to deadlines, used for policy decisions. "graduated decision thresholds based on slack ratio."
- SPOT instances: Preemptible, lower-cost cloud instances with availability interruptions. "deciding when to use cheap preemptible SPOT instances versus reliable ON_DEMAND instances to meet deadlines."
- Steiner tree: A graph structure that connects required nodes (terminals) possibly via extra nodes to minimize total cost. "provider-aware Steiner tree approach"
- SLP: Sequential Linear Programming; an iterative linearization method for nonlinear optimization. "SLP on centers + dual-like constraint push"
- TextGrad: A technique using LLM-generated โgradientsโ to guide text optimization. "TextGrad [28] uses LLM-generated "gradients" for text optimization."
- two-pass algorithms: Algorithms that process data in two stages to improve correctness or performance. "two-pass algorithms (compute statistics, then normalize)"
- Vision-capable LLMs (VLM): Multimodal LLMs that can process visual inputs. "images (via oa. Image) for Vision-capable LLMs (VLM)."
- Warp shuffle reductions: GPU intra-warp operations that aggregate values efficiently via warp shuffle instructions. "warp shuffle reductions"
Collections
Sign up for free to add this paper to one or more collections.