Papers
Topics
Authors
Recent
Search
2000 character limit reached

optimize_anything: A Universal API for Optimizing any Text Parameter

Published 19 May 2026 in cs.CL, cs.AI, cs.LG, cs.NE, and cs.SE | (2605.19633v1)

Abstract: Can a single LLM-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system-supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs-achieves state-of-the-art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score-only feedback, and that multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework. We open-source optimize_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa .

Summary

  • The paper introduces a unified LLM-based evolutionary search framework that optimizes any text parameter across diverse tasks.
  • It employs structured side information and Pareto-based candidate selection to accelerate convergence and enhance multi-task performance.
  • Empirical results show significant improvements in domains like scheduling, kernel generation, and agent design, highlighting its broad applicability.

Universal LLM-based Text Optimization: The optimize_anything Framework

Motivation and Context

The "optimize_anything: A Universal API for Optimizing any Text Parameter" (2605.19633) paper establishes a unified, domain-agnostic optimization paradigm for any artifact representable as text, leveraging LLM-driven evolutionary search. Prior frameworks, such as AlphaEvolve, GEPA, and FunSearch, achieved strong results within their respective domainsโ€”program synthesis, prompt engineering, or code generationโ€”but none demonstrated state-of-the-art performance across claims as diverse as agent design, CUDA kernel optimization, scheduling, and image/CAD generation with a single interface. The work posits that any problem expressible as maximizing an evaluator ff over string artifacts xx can be handled by LLM-augmented optimization, embedding diagnostic "side information" (SI) into the search loop. This reframes LLMs not only as generative agents but as universal problem solvers via guided text artifact evolution.

System Design and Optimization Modes

The core insight is the reduction of broad optimization problemsโ€”ranging from code, policies, and prompts to numeric configurations and even image representationsโ€”into a common API contract:

  • The user provides a seed artifact (string), an evaluator returning a score and SI, and optionally datasets for multi-task/generalization.
  • The framework orchestrates the search, selection, and candidate proposal, decoupled from hand-crafted mutation templates or domain-specific โ€œevolve blocksโ€ common in prior art.

Crucially, SI is treated as a first-class channel: the evaluator emits not just scalars but failure diagnostics (e.g., compiler errors, reasoning traces, visualizations), enabling the LLM proposer to make semantically informed refinements akin to engineer iteration rather than black-box numeric tweaking.

The system supports three distinct search modes under a single API:

  • Single-task: Directly optimizes an artifact for a single target metric.
  • Multi-task: Joint optimization across batches of related tasks; the Pareto frontier is shared, allowing cross-transfer of optimization patterns. This is uniquely operational in optimize_anything and is critical for domains such as kernel generation, where architectural motifs generalize across instances.
  • Generalization: Seeks artifacts that validate robustly on held-out data, extending beyond prompt optimization to agent architectures and scheduling policies.

Key Algorithmic Innovations

The optimization workflow is fundamentally evolutionary, with explicit architectural advances:

  • Pareto-based Candidate Selection: Instead of collapsing feedback into aggregate values, the system tracks per-task/subscore metrics over candidates and preserves any artifact excelling in at least one dimension. This maintains a diverse โ€œfrontlineโ€ of specialist solutions and prevents premature convergence.
  • Structured Reflection with SI: LLMs contextualize failures (not just scores) during dedicated โ€œreflectionโ€ phases, using SI to target improvements. For multi-module artifacts (e.g., agent + refiner prompt), Pareto leapfrogging enables mutual advancement.
  • Seedless and Flexible API: For tasks where even a poor initial seed is infeasible (e.g., 3D modeling), natural language objectives suffice, with LLMs bootstrapping candidate generation.
  • Adapter Layer and Backend Agnosticism: The architecture is modular with respect to the optimization backend. GEPAโ€™s reflective mutation/Pareto selection is leveraged by default but can be substituted as externally improved algorithms emerge.

Empirical Results and Claims

The paper reports strong, quantifiable improvements across six diverse domains, validated with cross-framework ablations and matched-budget reruns:

Domain Optimization Mode LLM Key Result
Agent Skills (Bleve repo) Generalization Claude Opus Pass rate: 79.3% โ†’ 98.3% (Haiku); transferable skills
Cloud Scheduling Generalization Gemini 3 Pro Cost savings: up to 40.2% (CloudCast benchmark)
ARC-AGI (agent architecture) Generalization Gemini Flash Test accuracy: 32.5% โ†’ 89.5%; 4-stage pipeline learned
AIME Prompt Optimization Generalization GPT-4.1 mini Test: 46.7% โ†’ 60.0%, +13.3pp over baseline
CUDA Kernel Generation Multi-task GPT-5 87% match/beat PyTorch baseline; 25% are 20% + faster
Circle Packing (n=26) Single-task GPT-5.1 Sum radii: 2.6360 (beats AlphaEvolve/OpenEvolve)

Ablation studies demonstrate:

  • Inclusion of SI (vs. score-only feedback) yields 4โ€“6x faster convergence and large improvements in final quality; for example, mean kernel speedups of 4.11ร— (with SI) vs. 1.15ร— (score only).
  • Multi-task mode consistently outperforms per-task independent optimization for related problems, confirming the benefit of cross-task pattern transfer via the shared Pareto set.
  • Proposer LLM quality directly impacts performance/cost tradeoffs, but even cost-optimized LLMs outperform strong hand-tuned baselines.

The qualitative analysis reveals the system's ability to auto-discover nontrivial algorithmic patterns such as break-even resource allocation in scheduling, verify-then-fallback meta-controllers in agents, and hybridized mathematical solvers for circle packing. Importantly, optimize_anything correctly identifies tasks where multi-task search is counterproductive (e.g., independent circle packing instances), highlighting the nuanced understanding of when cross-task transfer is beneficial.

Implications and Theoretical Ramifications

The results provide concrete evidence for the thesis that LLM-based evolutionary search, when equipped with structured SI and Pareto-based exploration, forms a general-purpose paradigm for programmatic optimization. In effect, this collapses problem-specific optimizer design (i.e., prompt engineers for LLMs, kernel hackers for GPU code, or combinatorialists for algorithmic puzzles) into data/model selection and SI contract design.

Practically, this framework democratizes optimization for any text-serializable domain, abstracting away meta-model selection, algorithmic tuning, and low-level mutation engineering from end-users. The declarative APIโ€”comprising artifact, evaluator, and optionally a datasetโ€”suggests future systems where domain experts only specify goals and diagnostic knowledge, delegating the optimization loop to universal LLMs.

Theoretically, the work situates SI as an analog to gradients in continuous optimization, but richer: SI can subsume structured data, textual explanations, and multimodal feedback, enabling LLM-driven evolution to traverse problem landscapes inaccessible to classical numerical search.

Future Directions

Several open questions and promising research trajectories follow:

  • Model-based Search: As LLMs and VLMs advance, the coupling of text-driven optimization with perception and control could further unify disparate AI tasks (e.g., hardware, robotics, design synthesis).
  • SI Automation: Automating the generation or extraction of high-value SI from evaluators could reduce the remaining domain expertise bottleneck. Learning โ€œwhat to explainโ€ to the LLM proposer is a critical meta-optimization target.
  • Artifact Modality Expansion: Extending the abstraction to non-text artifacts (binaries, graph structures) via serialization and evaluator-provided SI translation.
  • Resource-efficient Optimization: As evaluation cost remains a constraint for complex domains, further work is needed in budget-aware candidate prioritization, model distillation for proposal steps, and meta-learning optimal SI extraction.
  • Back-end Diversity: Integration of alternative evolutionary or hybrid optimizers and real-time selection/adaptation of backends in response to domain idiosyncrasies.

Conclusion

"optimize_anything" (2605.19633) demonstrates, with robust quantitative and qualitative evidence, that universal LLM-based text optimization is a viable, high-performance alternative to domain-specific optimizers for a broad family of tasks. The systemโ€™s advancesโ€”particularly the SI-based reflection mechanism, unified multi-mode optimization, and API simplicityโ€”eliminate the need for handcrafted evolutionary infrastructure per task. This constitutes a substantial step toward general-purpose, explainably guided artifact evolution, with deep implications for both the theory and practice of automated program and system design.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

What this paper is about

The paper introduces โ€œoptimize_anything,โ€ a tool that helps an AI improve almost any text-based blueprintโ€”like code, prompts, or even a full agentโ€™s designโ€”by trying ideas, getting scored, reading feedback, and then making smarter versions. The big idea is simple: if you can write something as text and build a way to score it, this system can help optimize it.

The main questions the paper asks

  • Can one AI-based system improve very different kinds of things (like GPU code, math prompts, or puzzle-solving agents) using the same method?
  • Is it better to learn from several related tasks at once rather than solving each one alone?
  • Does detailed feedback (not just a number score) help the AI learn faster and do better?
  • Can the system create solutions that work well on brand-new, unseen examples?

How the system works (in everyday terms)

Think of the process like revising a school project with a helpful teacher:

  1. You start with a draft (a โ€œtext artifactโ€ like code, a prompt, or an agentโ€™s rules).
  2. You hand it to a strict judge (the โ€œevaluatorโ€) that:
    • Gives a score (โ€œHow good is this?โ€).
    • Gives feedback notes (โ€œWhat went wrong?โ€), called Side Information (SI).
  3. An AI โ€œproposerโ€ reads the notes and writes a better draft.
  4. Repeat until you get strong results.

Hereโ€™s what makes it practical:

  • One simple interface for many problem types: If it can be represented as text (code, policy rules, prompts), this system can try to improve it.
  • Three ways to search for better solutions:
    • Single-task: Improve one thing (like one circle-packing program).
    • Multi-task: Improve several related things together, so tricks learned on one can help the others (like many GPU kernels).
    • Generalization: Improve one artifact so it works well on new, unseen examples (like prompts or agents tested on new problems).
  • Side Information (SI) is like a teacherโ€™s margin notes: not just โ€œYou got 70%,โ€ but โ€œLine 12 has a bug,โ€ or โ€œThis part is slow,โ€ or even a picture showing what went wrong.
  • Pareto-based selection is like keeping a team of specialists, not only the one with the best average: the system keeps candidates that are the best in at least one area (e.g., fastest, most accurate, least costly), so different strengths can be combined over time.
  • โ€œSeedlessโ€ mode can start from a plain English goal if you donโ€™t have any starting code or prompt.

What they tested and what they found

The authors tried the system on very different tasks and reported strong results:

  • Agent architecture for ARC-AGI puzzles (like pattern and logic puzzles): Evolved a simple agent into a 4-stage system and raised accuracy from 32.5% to 89.5%โ€”nearly triple.
  • Cloud scheduling (deciding how to run jobs cheaply without missing deadlines): Cut costs by up to about 40% by inventing smarter, provider-aware strategies.
  • GPU code (CUDA kernels) for speeding up PyTorch operations: 87% of generated kernels matched or beat the baseline speed; many were 10โ€“20% faster.
  • Math contest prompting (AIME): A better prompt boosted a small modelโ€™s score from 46.67% to 60.00% on a new test year, beating a known prompt-optimization method.
  • Circle packing (fitting circles in a square without overlaps): Beat a previous systemโ€™s best result on a well-known instance (n=26), and did it with fewer tries.
  • Images via SVG/CAD: Produced designs that human judges preferred over basic attempts, with much higher โ€œqualityโ€ scores from a vision model.

Two extra takeaways stood out:

  • Detailed feedback (SI) matters a lot: Across multiple tests, having actionable notes (like compiler errors or per-aspect scores) made learning 4โ€“6ร— faster and led to better final scores than just giving a single number.
  • Learning across tasks helps when tasks share patterns: In GPU code, optimizing many related kernels together outperformed solving each alone with the same budget. However, for unrelated tasks (like circle packing with different numbers of circles), multi-task learning didnโ€™t help and could even add noise.

Why these results are important

  • One method, many domains: Instead of building a different optimizer for each specialty (prompts, code, agents, images), you can use the same โ€œwrite text โ†’ score โ†’ read feedback โ†’ improveโ€ loop everywhere.
  • Faster progress with better feedback: Turning โ€œYou failedโ€ into โ€œHereโ€™s exactly why you failedโ€ lets the AI propose targeted fixes, much like a student improves faster with specific teacher comments.
  • Practical gains: Cheaper cloud runs, faster GPU code, smarter agents, and better prompts are immediately useful in real systems.
  • Accessible experimentation: Because artifacts are just text, the same interface works across tasks. Non-experts in optimization can still contribute by writing good evaluators that return helpful feedback.

What this could mean going forward

If many problems can be expressed as โ€œimprove this text and score it,โ€ then a single, open-source tool can accelerate progress across fieldsโ€”coding, systems, science, and designโ€”without custom-made optimizers for each one. That said, results depend on the quality of the AI and the evaluator, and evaluations can be costly. Still, the paper shows a promising path: using detailed feedback and a diverse pool of strong candidates, one unified system can find creative, high-performing solutions across very different kinds of challenges.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of concrete gaps and unresolved questions that future work could address to strengthen or extend the claims in the paper.

  • Reproducibility with open models: Most results rely on closed, premium models (e.g., GPTโ€‘5, Gemini 3, Claude 4.6). It remains unclear how well the system performs with strong open-source LLMs/VLMs (e.g., Llama, Qwen, Phi) across all domains and modes.
  • Proposerโ€“executor model mismatch: Beyond the coding-skills transfer across two Claude variants, the paper does not systematically study how performance changes when the proposer differs from the runtime model (e.g., proposer = small open model, executor = larger or different family), or when the proposer and agent models are intentionally decoupled.
  • Sensitivity to evaluator noise and nondeterminism: Optimization stability under noisy evaluators (e.g., VLM-based image scores, GPU benchmarking variance) is not quantified. There is no analysis of repeated trials, variance, or robustness to stochastic evaluators.
  • Mis-specification and reward hacking: The system could exploit evaluator loopholes (e.g., reading ground truth from SI, hardcoding answers, test leakage, code-injection or unsafe syscalls). Methods to detect and prevent reward hacking and to sandbox evaluations are not presented.
  • Safety and security of code execution: The framework executes generated code (e.g., exec(), CUDA kernels) but does not detail sandboxing, resource limits, network isolation, or permission controls to mitigate malicious or accidental harm.
  • SI quality and failure modes: While SI improves performance, the effects of low-quality, misleading, incomplete, or adversarial SI are not studied. There is no guidance on SI design trade-offs or automated SI curation/summarization for long traces.
  • Generalizable SI schemas: The โ€œtyped SI primitiveโ€ is proposed, but there is no systematic evaluation of reusable, cross-domain SI schemas or tools for auto-extracting SI from logs/telemetry with minimal domain engineering.
  • Ablations on algorithmic components: The paper does not isolate the contribution of several core design choices, including:
    • Pareto frontier vs. simpler selection (e.g., topโ€‘k by average score).
    • Minibatch reflection (2โ€“3 examples) vs. full-batch reflection.
    • The refiner stepโ€™s effect on failure rate and sample efficiency.
    • Content-addressed caching vs. naive deduplication.
  • Theoretical guarantees: There is no theoretical analysis of convergence, regret, or sample complexity for Pareto-based text optimization with SI, especially under noisy or non-stationary evaluators.
  • Scaling limits of multi-task search: The approach is shown on up to ~200 tasks, but there is no study of scaling to thousands of tasks, frontier growth dynamics, memory/computation overheads, or strategies to keep the frontier tractable.
  • Automatic detection of negative transfer: Multi-task search can hurt when tasks are independent (e.g., circle packing with different N). The framework does not include task-relatedness estimation, task clustering, or adaptive mode switching to avoid negative transfer.
  • Overfitting to validation in generalization mode: Repeated validation evaluation can lead to implicit overfitting. Guardrails (e.g., early stopping, restricted val queries, nested splits) are not described or evaluated.
  • CUDA kernel generality and baselines: Kernel results are reported on a V100 and 31 ops; portability to other GPUs/architectures (A100/H100/consumer GPUs), tensor shapes, and batch sizes is untested. Comparisons against specialized autotuners (e.g., TVM/Ansor, Triton autotuning) and across more comprehensive kernel suites are missing.
  • Realism of cloud scheduling evaluation: Results are from ADRS-like benchmarks; real-world deployment factors (failures, forecasting errors, bursty workloads, fairness/SLAs, multi-tenant interference) and robustness to shifting distributions are not investigated.
  • Constraint handling and multi-objective trade-offs: The framework uses Pareto selection but lacks a principled way to enforce hard constraints (e.g., deadlines, correctness) while optimizing other metrics, or to specify and tune trade-offs among competing objectives.
  • ARC-AGI evaluation rigor: The 89.5% result is promising, but:
    • Baselines against strong agent frameworks and state-of-the-art ARC systems are not provided.
    • Potential training-data contamination or model priors are not discussed.
    • Robustness across alternative ARC splits or related benchmarks (e.g., ARC-DA, different train/val/test partitions) is not evaluated.
  • Image evaluation validity: VLM-based scoring can be biased or gamed. Human evaluation used only five raters; no inter-rater reliability, larger blinded studies, or use of standardized perceptual metrics are reported.
  • Seedless mode reliability: Only a single 3D modeling anecdote is presented. The consistency, success rate, and failure cases of seedless optimization across tasks remain unquantified.
  • Cost and energy accounting: Reported dollar costs exclude detailed compute/energy metrics and do not decompose costs across propose vs. evaluate steps per domain, making it hard to predict operational expense at scale.
  • Frontier diversity metrics: Although Pareto diversity is claimed to prevent premature convergence, there are no quantitative diversity measures (e.g., edit distances, behavioral diversity) or analyses of diversityโ€“performance trade-offs.
  • Backend-agnostic claims: The interface is said to be backend-agnostic, but experiments primarily use a GEPA-derived backend. Comparative studies with alternative optimizers (e.g., evolutionary strategies, Bayesian optimization over edits, gradient-based methods like TextGrad) are absent.
  • Fair cross-framework comparisons: Apart from a controlled circle-packing run, most comparisons with prior systems are not performed under matched budgets, hardware, or identical evaluators/models, leaving fairness and effect sizes uncertain.
  • Task grouping and curriculum in multi-task mode: How to select, order, or cluster tasks to maximize transfer is not explored. There is no study of curricula, task-weighting, or adaptive sampling.
  • Handling continuous and non-text artifacts: The framework assumes text-serializable artifacts. Effectiveness and fidelity of text proxies for inherently continuous/binary artifacts (e.g., weights, bitstreams) are not evaluated.
  • Robustness to long SI/logs: Strategies for truncation, summarization, or structured extraction from very long SI (e.g., deep traces, profiler logs) are not provided, raising questions about scalability of reflection prompts.
  • Data governance and copyright: Generated code/algorithms may reflect training data. Licensing, attribution, and compliance implications are not addressed.
  • Ethical considerations: Optimizing policies (e.g., cloud scheduling) can impact fairness, energy footprint, and cost distribution. The paper does not discuss ethical guardrails or auditing for unintended consequences.
  • Automatic SI design: There is no method to learn or auto-suggest the most actionable SI for a given domain (e.g., via ablation-based attribution of SI components), which would reduce dependence on domain experts.
  • Noisy benchmark protocols for kernels and agents: The number of runs, warm-up strategies, and statistical testing for performance measurements (e.g., GPU timing variance, flaky tests in agents) are not described.
  • Stopping criteria and budget allocation: Guidelines for stopping conditions, adaptive budget allocation across tasks, and balancing exploration vs. exploitation are not specified.
  • Transferability across hardware and environments: Apart from coding-skills transfer across two Claude variants, cross-hardware transfer (e.g., kernels across GPUs/CPUs) and cross-environment transfer (e.g., schedulers across cloud providers) are not empirically studied.
  • Detecting and mitigating catastrophic proposals: The framework adds a โ€œrefiner step,โ€ but there is no analysis of catastrophic proposal rates (e.g., broken kernels, infinite loops) or recovery strategies beyond simple filtering.
  • Integration with human-in-the-loop feedback: While SI can include arbitrary feedback, the paper does not explore mechanisms for incorporating human ratings/edits efficiently (e.g., active learning, preference optimization) or measuring the gains from occasional human interventions.

Practical Applications

Immediate Applications

The paperโ€™s unified โ€œtext artifact + evaluator + side information (SI)โ€ approach can be deployed now in many settings by wrapping existing workflows with evaluate(candidate) โ†’ (score, diagnostics) and letting optimize_anything iterate. Below are concrete, sector-linked use cases, including likely tools/workflows and key dependencies.

  • Repository-specific coding โ€œskillsโ€ for AI code assistants (Software/Dev Productivity)
    • What: Evolve natural-language โ€œskillsโ€ and best-practice snippets tailored to a codebase to boost AI coding agentsโ€™ task completion and reduce time-to-fix (as demonstrated on Bleve).
    • How: Integrate optimize_anything into a repoโ€™s dev container; evaluator runs agent on task suites; SI captures tool-call traces, compile/test results, and error logs.
    • Tools/workflows: VS Code/JetBrains extension; CI job that re-trains skills nightly; LangChain/DSPy pipeline with GEPA backend.
    • Dependencies/assumptions: Stable task harness; representative task set; access to capable LLMs; sandboxed execution for safety.
  • CI/CD failure triage and flaky-test reduction (Software Engineering)
    • What: Optimize agent prompts/flows that diagnose and fix failing tests faster using SI (stack traces, logs).
    • How: Evaluator replays failures in isolated environments; SI includes error signatures and coverage gaps; search evolves โ€œtriage playbooks.โ€
    • Tools/workflows: GitHub Actions/GitLab CI plugin; Pareto selection across failure categories.
    • Dependencies: Reliable repro harness; log/trace collection; test isolation to avoid side effects.
  • Enterprise prompt optimization for LLM apps (Software, Customer Support, Education)
    • What: Automatically tune system prompts and few-shot exemplars to improve accuracy and reduce token cost (e.g., +13.3pp on AIME-2025).
    • How: Evaluator grades task performance on held-out examples; SI includes sub-scores (reasoning steps, error types).
    • Tools/workflows: โ€œPromptOpsโ€ service integrated with LangSmith, DSPy, or in-house prompt store; A/B rollout gates.
    • Dependencies: Ground-truth datasets or rubric-based auto-graders; cost budget for proposer/evaluator calls.
  • Cloud cost/performance policy tuning (Cloud/DevOps)
    • What: Discover scheduling policies for multi-cloud data transfers and spot/on-demand mix to cut costs (e.g., 40%+ for routing; 7.8% on spot policies).
    • How: Evaluator simulates workloads; SI surfaces egress cost breakdowns, utilization timelines, SLA misses.
    • Tools/workflows: Kubernetes scheduler plugin, Airflow policy tuner, Dataplane policy CRDs with nightly optimization runs.
    • Dependencies: Realistic simulators or shadow-mode evaluation; access to price and availability traces; safe rollout strategy.
  • GPU kernel auto-generation and tuning (ML Systems/HPC)
    • What: Generate CUDA kernels that meet or beat PyTorch baselines (87% matched/faster) for custom ops.
    • How: Evaluator compiles, verifies correctness vs. references, benchmarks performance; SI provides NVCC errors, profiler hints, speedup ratios.
    • Tools/workflows: PyTorch/Triton plugin that swaps in optimized kernels; multi-task runs across operator sets; regression guardrails.
    • Dependencies: Hardware access; correctness oracle; sandboxed compiler; allowances for driver/toolchain updates.
  • Agent architecture optimization for task-specific assistants (Software/RPA)
    • What: Auto-design multi-stage agent pipelines (e.g., analyze โ†’ code โ†’ verify โ†’ fallback) that generalize to new tasks (e.g., ARC-AGI jump from 32.5%โ†’89.5%).
    • How: Evaluator runs agent on train/val tasks; SI includes per-submodule traces, error/fallback counts, cost metrics.
    • Tools/workflows: โ€œAgent Architectโ€ workflow on top of DSPy; templated harnesses for customer support, research agents, or ETL automations.
    • Dependencies: Robust task harness; budget for evaluation rollouts; secure code execution; red-team testing for safety.
  • Hyperparameter and solver-heuristic evolution (ML/Optimization)
    • What: Evolve solver code/heuristics and hyperparameter schedules that outperform generic HPO (paper reports wins vs. Optuna in a black-box suite).
    • How: Evaluator returns objective values and subscores/violations; SI includes constraint diagnostics, convergence traces.
    • Tools/workflows: Optuna/Weights & Biases integration with custom evaluator; nightly auto-tuner for pipelines.
    • Dependencies: Deterministic evaluation or sufficient repeated trials; proper metric caching.
  • SVG/CAD asset generation for design teams (Design/Marketing/Front-end)
    • What: Produce on-brand icons, diagrams, and simple CAD shapes rated by a VLM on multiple visual aspects; humans preferred optimized outputs over baseline.
    • How: Evaluator renders candidates and queries a VLM; SI provides per-aspect scores and annotated diff screenshots.
    • Tools/workflows: Figma/Onshape plugin; batch optimization across a design system with multi-task Pareto selection.
    • Dependencies: VLM access; brand/style rubrics; rendering pipeline; IP review.
  • Academic algorithm prototyping and discovery (Academia/Operations Research)
    • What: Rapidly prototype algorithmic ideas for hard problems (e.g., circle packing bilevel solvers) using SI-guided code evolution.
    • How: Evaluator scores objective and returns constraint/geometry diagnostics and images; SI drives targeted shifts (e.g., switch to LP/SLP).
    • Tools/workflows: Jupyter/Colab integration; versioned frontier artifacts; artifact caching for reproducibility.
    • Dependencies: Accurate simulators/objective functions; compute budget; willingness to audit generated methods.
  • Multi-service policy patterns via multi-task search (Cross-cutting DevOps/ML)
    • What: Co-optimize families of similar tasks (e.g., GPU ops, microservice autoscaling rules) to discover reusable patterns.
    • How: Shared Pareto frontier across tasks; SI highlights per-task wins/failures; each task exports its own best artifact.
    • Tools/workflows: Organization-wide โ€œpattern frontierโ€ for ops and ML teams; templated evaluators.
    • Dependencies: Tasks must share underlying structure; evaluation isolation; governance for cross-service adoption.
  • SI Instrumentation SDK for evaluators (Platform/Tooling)
    • What: Standardize capturing compiler errors, profiling summaries, sub-scores, and images as SI to accelerate convergence (4โ€“6ร— speed-ups shown).
    • How: Lightweight Python/Go hooks to collate logs/metrics into a side_info dict; optional oa.Image for visuals.
    • Tools/workflows: Shared org SDK; conventions for per-metric scoring; dashboards for frontier evolution.
    • Dependencies: Domain expertise to pick actionable diagnostics; privacy/PII filtering.
  • Personal prompt and automation tuning (Daily Life/Productivity)
    • What: Optimize prompts and small scripts for personal assistants (task lists, email sorting, study aids) to reduce friction.
    • How: Evaluator measures task success/time saved on sample scenarios; SI includes failure categories and misclassifications.
    • Tools/workflows: Local sandbox with API keys; scheduled re-optimization as habits change.
    • Dependencies: Safe execution; clear success metrics; modest LLM budget.

Long-Term Applications

These leverage the same paradigm but require more research, robust verifications, broader integration, or regulatory clarity.

  • Autonomous performance compiler integrated with toolchains (Software/HPC)
    • What: Always-on kernel/pipeline optimizer embedded in compilers (CUDA/Triton/TVM), evolving code with SI from profilers.
    • Potential products: โ€œKernelSmith Proโ€ compiler pass; IDE-in-the-loop autotuning.
    • Dependencies: Strong correctness guarantees; regression test suites; deterministic profiling; vendor cooperation.
  • First-class cloud scheduler optimizer for Kubernetes and cloud providers (Cloud/DevOps)
    • What: Optimize cluster-wide policies (bin-packing, spot preemption, data egress routing) with train/val simulation and safe rollout.
    • Potential products: โ€œK8s Scheduler Tunerโ€ with shadow-mode evaluation; cloud-native managed service.
    • Dependencies: High-fidelity simulators; safety guardrails; multi-tenant SLA compliance; change-management workflows.
  • Safety-critical decision policy tuning (Healthcare/Transportation/Energy)
    • What: Optimize scheduling/triage/dispatch policies using rich SI (e.g., wait times, fairness, risk), with formal constraints.
    • Potential products: Co-optimizer for OR staffing, EMS dispatch, grid load shedding.
    • Dependencies: Regulatory approvals; interpretability and audit trails; rigorous offline evaluation and formal verification; bias/fairness safeguards.
  • Robotics controller and planning code evolution (Robotics/Industrial Automation)
    • What: Evolve high-level controller code and planning heuristics using SI (collision metrics, latency, energy).
    • Potential products: โ€œPlanner Architectโ€ for warehouse/AGV fleets; sim-to-real adapters.
    • Dependencies: High-fidelity simulators; safety envelopes; real-time constraints; robust sandboxing.
  • Finance strategy and risk policy optimization (Finance)
    • What: Optimize rule-based strategies and risk controls with SI (PnL attribution, drawdowns, VaR breaches).
    • Potential products: โ€œStrategy Tunerโ€ for backtesting engines; policy evolution for credit risk models.
    • Dependencies: Strict compliance; out-of-sample validation; market regime shifts; data access and privacy.
  • Public policy and urban systems planning (Policy)
    • What: Optimize routing, zoning, or subsidy allocation policies with simulation feedback as SI (equity metrics, congestion, emissions).
    • Potential products: Civic โ€œPolicy Labโ€ integrating traffic/agent-based models; participatory what-if tooling.
    • Dependencies: Stakeholder governance; transparent objectives; robust, unbiased simulators; legal frameworks.
  • Education content and tutor pipeline optimization (Education)
    • What: Evolve curricula prompts/agent flows for tutoring systems to improve learning outcomes on held-out cohorts.
    • Potential products: โ€œTutor Architectโ€ that tunes hinting strategies and assessment rubrics.
    • Dependencies: Ethical study designs; ground-truth outcomes; privacy-preserving data; pedagogy-informed SI.
  • Cross-organization multi-task pattern sharing (Platform/Enterprise)
    • What: Federated optimization across business units to share strategies via Pareto frontier without sharing raw data.
    • Potential products: Frontier registry with privacy constraints; pattern โ€œpull requests.โ€
    • Dependencies: Federated evaluation; IP and privacy agreements; standard SI schemas.
  • Standards for SI and evaluator contracts (Ecosystem/Policy)
    • What: Define interoperable SI taxonomies for domains (compilers, schedulers, agents) to enable plug-and-play optimizers.
    • Potential products: Open specs, conformance suites, and reference SDKs.
    • Dependencies: Industry consortia; domain consensus; security/PII guidelines.
  • General-purpose โ€œoptimize-anythingโ€ marketplace (Platform)
    • What: Catalog of artifacts (prompts, policies, kernels, agents) with provenance and validation scores; one-click deployment.
    • Potential products: Artifact registry with automated retesting on updates; license-aware distribution.
    • Dependencies: IP/licensing frameworks; reproducibility pipelines; trust signals and audits.
  • Verified agent architectures for critical workflows (Software/GRC)
    • What: Auto-designed, formally checked agent pipelines with enforced fallbacks and limits.
    • Potential products: โ€œVerified Agent Architectโ€ integrating with policy-as-code (e.g., OPA) and formal methods.
    • Dependencies: Formal verification integration; runtime monitors; red-teaming; provable safety constraints.

Cross-cutting assumptions and dependencies (affecting feasibility across applications)

  • Evaluator quality is pivotal: must produce reliable scalar scores and actionable SI; poor SI reduces gains.
  • Task relatedness matters for multi-task benefits; unrelated tasks can degrade results (as observed in circle packing).
  • Access to capable LLMs/VLMs and compute budget; evaluation often dominates cost.
  • Secure sandboxing for executing generated code; strong test oracles for correctness.
  • Text-serialization of artifacts is required; non-text domains need faithful text proxies.
  • Governance: audit trails, explainability, bias checks, and change management for production integration.
  • Data privacy and compliance for domains with sensitive information (healthcare, finance, public sector).

These applications translate the paperโ€™s core findingโ€”that many optimization problems can be reframed as text artifact search guided by rich diagnosticsโ€”into concrete tools and workflows that teams can adopt today, while outlining ambitious but attainable directions as evaluators, standards, and verification mature.

Glossary

  • ADRS benchmark: A benchmark suite for evaluating cloud infrastructure algorithms. "We optimize two cloud infrastructure algorithms from the ADRS benchmark [6]."
  • ADAS: Automated Design of Agentic Systems; a method for searching agent architectures. "ADAS [11] and AFlow [29] search over agent architectures."
  • AFlow: A framework for agent architecture search. "ADAS [11] and AFlow [29] search over agent architectures."
  • AIME: The American Invitational Mathematics Examination; used here as a benchmark for prompt optimization. "We optimize a system prompt for GPT-4.1-mini on AIME (American Invitational Mathematics Examination) competition prob- lems."
  • AlphaEvolve: An LLM-evolution framework using MAP-Elites to discover algorithms. "AlphaEvolve [18] pioneered the LLM-evolution paradigm, using Gemini models with island-based MAP-Elites [17] to discover algorithms for Google's infrastructure."
  • ARC-AGI: A benchmark of abstraction and reasoning tasks assessing general intelligence-like capabilities. "The optimization objective is for the artifact to generalize to unseen ARC-AGI [7] puzzles."
  • Bayesian optimizer: A Bayesian optimization method for black-box functions. "For example, one cannot show a Bayesian optimizer a stack trace."
  • Bilevel optimizer: An optimization scheme with upper- and lower-level problems solved in tandem. "The optimized algorithm is a bilevel optimizer: an LP over radii with dual-variable gradients for L-BFGS- B center optimization, augmented by CMA-ES exploration and diverse seeding strategies."
  • build123d: A Python CAD modeling library used to generate parametric models. "We generate SVG code and CAD models (via build123d) for four image goals (Table 10 in Appendix H)."
  • Circle packing: The optimization problem of arranging circles within a region to maximize packed size without overlap. "Circle Packing (num circles = 26)"
  • CMA-ES: Covariance Matrix Adaptation Evolution Strategy, a derivative-free optimization algorithm. "augmented by CMA-ES exploration and diverse seeding strategies."
  • content-addressed evaluation caching: A caching technique that avoids re-evaluating identical artifacts by hashing content. "content-addressed evaluation caching to avoid redundant expensive rollouts;"
  • CUDA kernel: A GPU-executed function written for NVIDIAโ€™s CUDA platform. "We generate CUDA kernels for 31 reference PyTorch op- erations from KernelBench [20]"
  • Dijkstra routing: Shortest-path routing based on Dijkstraโ€™s algorithm. "CloudCast achieves 40.2% cost savings over Dijkstra routing (Figure 3a),"
  • EVOLVE-BLOCK markers: Special prompt markers used by prior frameworks to delimit evolvable code regions. "Specifically, optimize_anything doesn't require mutation prompts, task-specific templates, island configurations, or EVOLVE-BLOCK markers (all common in prior frameworks)."
  • float4 vectorization: Using 4-element floating-point vectors to increase memory and compute throughput. "The evolved kernels employ techniques such as float4 vectorization, two-pass algorithms (compute statistics, then normalize), warp shuffle reductions, and shared memory tiling."
  • GEPA: A reflective prompt evolution algorithm leveraging Pareto-based search. "GEPA [3] achieves state-of-the-art prompt optimiza- tion with generalization to unseen inputs, but is limited to prompts;"
  • GEPAAdapter: An adapter applying GEPA-style optimization to full agent programs. "building on an earlier proof-of-concept with GEPAAdapter [1]."
  • Generalization mode: An optimization setting where artifacts are tuned on training data to perform on unseen validation data. "Both use generalization mode with training/validation splits over infrastructure scenarios."
  • KernelBench: A benchmark suite of PyTorch operations for evaluating generated GPU kernels. "We generate CUDA kernels for 31 reference PyTorch op- erations from KernelBench [20]"
  • L-BFGS-B: Limited-memory BFGS algorithm with bound constraints. "an LP over radii with dual-variable gradients for L-BFGS- B center optimization"
  • Linear programming (LP): Optimization of a linear objective subject to linear constraints. "an LP over radii"
  • MAP-Elites: A quality-diversity evolutionary algorithm that explores diverse high-performing solutions. "island-based MAP-Elites [17]"
  • memory coalescing: Aligning GPU memory accesses across threads to maximize bandwidth utilization. "insights discovered for one problem (e.g., how to handle memory coalescing) transfer to others"
  • MIPROv2: A method targeting prompt and few-shot selection optimization. "MIPROv2 [19] similarly targets prompt and few-shot selection."
  • Multi-task search: Jointly optimizing across multiple related tasks to enable cross-task transfer. "No prior system supports multi-task search, where solving a batch of related problems together enables cross-transfer of discovered optimization patterns."
  • NVCC: NVIDIA CUDA Compiler for compiling CUDA kernels. "SI includes: (i) NVCC compiler errors with line numbers,"
  • ON_DEMAND instances: Reliable, non-preemptible cloud compute instances. "deciding when to use cheap preemptible SPOT instances versus reliable ON_DEMAND instances to meet deadlines."
  • OpenEvolve: An open-source, model-agnostic reimplementation of AlphaEvolve. "OpenEvolve [24] provides an open-source reimplementation with model-agnostic support."
  • Optuna: A hyperparameter optimization framework for black-box optimization. "matching and outperforming Optuna in numerical optimization,"
  • Pareto-based search: Selection strategy that maintains candidates excelling across different objectives without collapsing to an average. "We achieve these results by extending the Pareto-based search of Agrawal et al."
  • Pareto dominance: A multi-objective relation where one solution is at least as good in all objectives and better in at least one. "per-example or per-metric Pareto dominance rather than aggregate scores,"
  • Pareto frontier: The set of nondominated solutions in multi-objective optimization. "maintains a Pareto frontier: any candidate that is the best at something survives, even if its average is suboptimal."
  • Reflexion: A self-correction technique using verbal reinforcement for agents. "Reflexion [25] uses verbal reinforcement for agent self-correction."
  • Refiner step: A pre-evaluation pass that fixes common LLM generation errors to prevent failed executions. "a refiner step that catches common LLM generation artifacts (malformed code blocks, im- port errors, syntax issues) before evaluation"
  • Scalable Vector Graphics (SVGs): A text-based vector image format for 2D graphics. "Scalable Vector Graphics (SVGs), or a system prompt, the structure is the same:"
  • Seedless mode: Starting optimization without a seed artifact by bootstrapping from a natural-language objective. "Seedless mode makes the system accessible to users who can specify what they want but not implement it."
  • Self-Refine: An approach where models iteratively improve their outputs using self-generated feedback. "Self-Refine [15] applies iter- ative self-feedback."
  • Shared memory tiling: Organizing data into tiles in GPU shared memory to improve locality and throughput. "shared memory tiling."
  • Side Information (SI): Diagnostic feedback returned by evaluators to guide LLM-driven revisions. "The evaluator returns a score plus Side Information (SI) - domain context explaining why."
  • Slack ratio: A measure of scheduling slack relative to deadlines, used for policy decisions. "graduated decision thresholds based on slack ratio."
  • SPOT instances: Preemptible, lower-cost cloud instances with availability interruptions. "deciding when to use cheap preemptible SPOT instances versus reliable ON_DEMAND instances to meet deadlines."
  • Steiner tree: A graph structure that connects required nodes (terminals) possibly via extra nodes to minimize total cost. "provider-aware Steiner tree approach"
  • SLP: Sequential Linear Programming; an iterative linearization method for nonlinear optimization. "SLP on centers + dual-like constraint push"
  • TextGrad: A technique using LLM-generated โ€œgradientsโ€ to guide text optimization. "TextGrad [28] uses LLM-generated "gradients" for text optimization."
  • two-pass algorithms: Algorithms that process data in two stages to improve correctness or performance. "two-pass algorithms (compute statistics, then normalize)"
  • Vision-capable LLMs (VLM): Multimodal LLMs that can process visual inputs. "images (via oa. Image) for Vision-capable LLMs (VLM)."
  • Warp shuffle reductions: GPU intra-warp operations that aggregate values efficiently via warp shuffle instructions. "warp shuffle reductions"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 179 likes about this paper.