EVIL: Evolving Interpretable Algorithms for Zero-Shot Inference on Event Sequences and Time Series with LLMs

Published 17 Apr 2026 in cs.LG and cs.AI | (2604.15787v1)

Abstract: We introduce EVIL (\textbf{EV}olving \textbf{I}nterpretable algorithms with \textbf{L}LMs), an approach that uses LLM-guided evolutionary search to discover simple, interpretable algorithms for dynamical systems inference. Rather than training neural networks on large datasets, EVIL evolves pure Python/NumPy programs that perform zero-shot, in-context inference across datasets. We apply EVIL to three distinct tasks: next-event prediction in temporal point processes, rate matrix estimation for Markov jump processes, and time series imputation. In each case, a single evolved algorithm generalizes across all evaluation datasets without per-dataset training (analogous to an amortized inference model). To the best of our knowledge, this is the first work to show that LLM-guided program evolution can discover a single compact inference function for these dynamical-systems problems. Across the three domains, the discovered algorithms are often competitive with, and even outperform, state-of-the-art deep learning models while being orders of magnitudes faster, and remaining fully interpretable.

Abstract PDF Upgrade to Chat

Authors (1)

David Berghaus

Summary

The paper introduces EVIL, an LLM-driven evolutionary pipeline that evolves interpretable, zero-shot Python/NumPy heuristics for dynamical systems inference.
It demonstrates competitive accuracy and speed, outperforming neural models on tasks such as event prediction, rate estimation, and time series imputation.
Evolved algorithms are transparent and data-efficient, enabling rapid deployment and scientific auditability without dataset-specific retraining.

LLM-Guided Evolution of Interpretable Algorithms for Zero-Shot Inference on Event Sequences and Time Series

Introduction

The paper "EVIL: Evolving Interpretable Algorithms for Zero-Shot Inference on Event Sequences and Time Series with LLMs" (2604.15787) proposes EVIL, an LLM-guided evolutionary pipeline for discovering interpretable, zero-shot inference algorithms for dynamical systems. EVIL leverages LLMs to guide evolutionary search over programs, yielding pure Python/NumPy heuristics per task. Unlike neural network-based approaches that require dataset-specific training, the evolved programs are designed to operate without retraining, generalizing across datasets and output dimensions within a task family.

This approach is demonstrated on three core inference tasks: prediction in marked temporal point processes (MTPPs), rate matrix estimation for Markov jump processes (MJPs), and time series imputation. The resulting algorithms are not only competitive with, and sometimes outperform, state-of-the-art neural baselines, but also provide orders-of-magnitude efficiency gains and full interpretability.

Figure 1: Overview of the EVIL approach. The same EVIL evolutionary procedure is applied separately to temporal point processes, Markov jump processes, and time series imputation, yielding one interpretable Python/NumPy inference function per task that generalizes across datasets in a zero-shot, in-context manner.

Methodology

EVIL employs LLM-driven program evolution (specifically, OpenEvolve/AlphaEvolve-style (2604.15787)) to discover compact, interpretable Python/NumPy routines for zero-shot amortized inference. Solutions are evolved to maximize a task-specific fitness function (e.g., prediction accuracy, log-likelihood, MAE), operating on minimal evaluation splits subsampled from benchmark datasets or on fully synthetic data priors.

Key methodological features:

Evolutionary Search: An ensemble of LLMs proposes code changes; candidate programs are evaluated, and the search is organized via MAP-Elites with island-based populations for diversity.
Strict Simplicity/Interpretability Constraints: Only Python/NumPy is allowed; no deep learning libraries, and the returned function must be short and fully readable.
Zero-Shot Setting: There is no per-dataset finetuning; a single algorithm is discovered per task, meant to work out-of-the-box on any dataset or mark dimension within the problem class.
Evaluation Protocol: For real datasets, only partial training splits are used for evolution, while the test set and full training data remain unseen. For "synthetic prior" baselines, all evolution is on synthetic data, evaluating transfer to real-world datasets.

Marked Temporal Point Processes

For MTPP next-event and long-horizon prediction, evolutionary search discovers algorithms that utilize context mark transition statistics and robust interval estimates. The heuristics combine recency-weighted inter-event gaps, mark-conditioned averages, and context-derived transition matrices. Notably, the discovered programs:

Match neural baselines in next mark prediction accuracy (often 91%) and inter-event error (sMAPE, RMSE), sometimes surpassing models (e.g., on Taxi, EVIL achieves a lower sMAPE than all neural baselines).
Provide nontrivial generalization to datasets with far more marks than seen during evolution (e.g., MIMIC-II with 75 mark types), where neural TPP models typically require retraining for new output dimensions.
Achieve evaluation speeds approximately 200x faster than transformer-based foundation models, running in seconds on CPU versus tens of minutes on GPU.
Figure 2: Illustration of mark alternation in the Taxi dataset, a structure directly exploited by evolved EVIL heuristics.

In contrast to deep neural models, the logic of the evolved solution can be directly inspected, typically blending local sequence statistics with context-level regularities via recency-weighting, smoothing, and explicit fallbacks for underrepresented transitions.

Markov Jump Processes

For continuous-time MJPs, EVIL evolves global estimators for the initial-state distribution and rate matrix from discretized, noisy observations. The inferred rate matrices are constructed through robust exposure and transition counting, with adaptive smoothing and exposure clipping for noise suppression.

Major observations:

On the discrete flashing ratchet (DFR) synthetic benchmark, EVIL (synthetic prior) tracks ground-truth dynamic observables such as entropy production and latent rates closely, often outperforming neural and foundation model baselines on key physical parameters (e.g., better recovery of the ratchet asymmetry parameter $V$ ).
In downstream rollout quality, measured via time-averaged Hellinger distance between simulated and true path distributions, the evolved algorithms consistently achieve lower error on challenging real-world datasets (e.g., ion channel switching, molecular kinetics) than neural and FIM-MJP baselines.
Figure 3: Entropy production tracking on DFR; the EVIL heuristic, evolved on synthetic data only, matches or exceeds FIM-MJP in ground truth adherence as rates become disparate.

Importantly, all rate estimation is transparent: the evolved logic for noise filtering, exposure reweighting, and rate assignment is human-auditable, facilitating scientific trust in applications.

Time Series Imputation

For multivariate time-series imputation, the evolved heuristics are hybrid, employing motif retrieval for large missing windows and linear interpolation for short, local gaps. The motif retrieval mechanism searches the series for recurring patterns with matching pre-gap context and pastes the continuation, applying a level shift to preserve continuity.

Performance characteristics:

On 50% missing point-wise imputation settings, EVIL matches or outperforms learned imputation models and foundation models on both training and held-out datasets, exhibiting robust zero-shot transfer.
On challenging window-missing tasks (e.g., motion capture with long gaps), EVIL is competitive, but foundation models pretrained on vast ODE solutions (e.g., FIM- $\ell$ ) retain a margin.
Figure 4: MAE performance on motion capture data with 20% windowed missingness across various models, including EVIL.

Motif-based strategies have emerged in concurrent work, such as context parroting [zhang2026context], supporting the hypothesis that motif retrieval provides a strong inductive bias for imputation in dynamical systems.

Search Dynamics and Program Discovery

Extended evolutionary runs indicate that while the majority of validation test performance is obtained within 100 search iterations, continued evolution does find new, sometimes more complex programs. However, gains plateau quickly, and the simplicity bias persists: most highly performant solutions remain short and parsimonious, with little evidence that much larger or more intricate heuristics are necessary for benchmark-level performance.

Figure 5: Discovery trajectory of new best validation-set programs over an 800-iteration run, highlighting continuous algorithmic innovation.

Discussion and Implications

The core empirical finding is that LLM-guided program evolution can reliably discover compact, interpretable algorithms that rival or exceed the predictive performance of state-of-the-art neural models in zero-shot, in-context settings across multiple dynamical systems domains. The evolved solutions are generically applicable, highly data-efficient (often requiring only small synthetic or subsample-based evolution datasets), rapid to evaluate, and fully transparent.

The demonstration that such amortized inference procedures exist—and are discoverable by LLM-driven search—has theoretical implications for algorithmic induction and interpretable machine learning: it reinforces the Occam bias induced by program induction and suggests the prior encoded in pretrained LLMs is strong enough that effective heuristics for rich, history-dependent processes can be edited and audited directly.

Practically, EVIL can serve as a lightweight, trustworthy baseline for scientific, industrial, and medical domains where interpretability, speed, and cross-dataset transferability are paramount. For rapid prototyping, exploratory data analysis, or deployment on resource-constrained hardware, EVIL-discovered programs provide an alternative to expo-heavy, opaque neural architectures. For iterative scientific discovery, the transparency of the program space facilitates direct integration of domain knowledge and rapid modification.

Conclusion

This work validates that LLM-guided evolutionary search is a practical method for automated scientific heuristic discovery. By focusing the search on the space of transparent, parameter-free algorithms, EVIL achieves strong (and at times superior) zero-shot inference across diverse dynamical systems tasks with minimal data, compute, and engineering overhead. Future avenues include extending the framework to tasks that require explicit modeling of epistemic uncertainty, augmenting the search space with probabilistic constructs, and integrating richer forms of domain-specific prior knowledge through prompt engineering. The results prompt a reevaluation of when deep neural architectures are necessary and underscore the continuing value of interpretable, computable baselines for the modern AI pipeline.