MIMIC-Py: An Extensible Tool for Personality-Driven Automated Game Testing with Large Language Models

Published 9 Apr 2026 in cs.SE and cs.AI | (2604.07752v1)

Abstract: Modern video games are complex, non-deterministic systems that are difficult to test automatically at scale. Although prior work shows that personality-driven LLM agents can improve behavioural diversity and test coverage, existing tools largely remain research prototypes and lack cross-game reusability. This tool paper presents MIMIC-Py, a Python-based automated game-testing tool that transforms personality-driven LLM agents into a reusable and extensible framework. MIMIC-Py exposes personality traits as configurable inputs and adopts a modular architecture that decouples planning, execution, and memory from game-specific logic. It supports multiple interaction mechanisms, enabling agents to interact with games via exposed APIs or synthesized code. We describe the design of MIMIC-Py and show how it enables deployment to new game environments with minimal engineering effort, bridging the gap between research prototypes and practical automated game testing. The source code and a demo video are available on our project webpage: https://mimic-persona.github.io/MIMIC-Py-Home-Page/.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces MIMIC-Py, a modular framework that employs personality-driven LLM agents to automate game testing.
It utilizes a hybrid planning strategy and the PathOS personality model to generate behaviorally diverse action plans across varied game environments.
Empirical results demonstrate improved branch and interaction-level coverage, underscoring the framework's potential for scalable quality assurance.

MIMIC-Py: An Extensible Tool for Personality-Driven Automated Game Testing Using LLM Agents

Introduction

Automated testing of video games presents unique challenges, mainly due to the non-deterministic, complex nature of modern game environments and the large, highly variable state and action spaces involved. Traditional machine learning approaches such as Reinforcement Learning (RL) and Imitation Learning (IL) require highly tuned reward functions or costly expert demonstrations and offer limited extensibility across diverse games. Recent successes with LLMs in game-playing agents have shown improved adaptability; however, existing solutions frequently lack mechanisms for cross-game deployment and do not adequately capture the behavioral diversity of human players.

MIMIC-Py addresses these limitations by systematizing personality-driven LLM agents within a reusable, extensible testing framework. By operationalizing configurable personality traits and decoupling planning, execution, and memory mechanisms, MIMIC-Py enables scalable deployment of behaviorally diverse agents across heterogeneous game environments with minimal adaptation overhead.

Figure 1: Overview of the MIMIC-Py framework, illustrating the modular pipeline comprising Planner, Action Executor, Action Summarizer, and Memory System for configurable, personality-guided game testing.

System Architecture

The MIMIC-Py framework consists of a modular pipeline with four core components: the Planner, Action Executor, Action Summarizer, and Memory System. The system operates in an iterative fashion, where, given a testing objective and specified personality trait, the Planner generates an action plan, the Action Executor interfaces with the game, and the Action Summarizer logs outcomes and manages structured memory for context-aware planning.

Planner: Hybrid, Personality-Driven Reasoning

The Planner leverages an LLM conditioned on the PathOS personality trait model—seven behavioral archetypes synthesized from empirical player modeling studies. Personality traits modulate the action plan via configurable prompt injection. Hybrid planning unites bottom-up (fine-grained, reactive) and top-down (task-structured, hierarchical) strategies, enabling robust generalization across both exploratory environments and tasks with long-horizon dependencies.

This hybrid paradigm is explicitly designed to transcend the limitations of next-action-only policies typically seen in LLM-based agents, as discussed in prior work. The architecture further incorporates plan validation and revision mechanisms, which serve to maintain feasibility and alignment with the game's action space semantics.

Personality Model and Abstractions

The PathOS model encapsulates personality via seven predefined traits: Achievement, Adrenaline, Aggression, Caution, Completion, Curiosity, and Efficiency. Personality-conditioning is abstracted through user-editable prompt configurations. This, combined with lightweight one-to-one mappings of PathOS entities to in-game concepts, permits low-friction adaptation to novel game environments. Such abstraction underpins the generalizability and reusability claims of MIMIC-Py.

To address interface diversity across games, the Action Executor exposes two primary mechanisms:

Plan-to-Parameters Translator: For environments with structured, high-level APIs, action plans are directly mapped to API parameters for efficient execution.
Plan-to-Code Translator: For games such as Minecraft, which expose only low-level APIs, the Translator synthesizes executable code snippets (Skills) leveraging example code and natural language guidance. Iterative refinement is supported via Action Summarizer feedback, and new Skills are accumulated in a reusable library.

Custom translators are supported via configuration, allowing for middleware layers that bridge non-standard or proprietary interfaces with the MIMIC-Py core.

Memory System: Retrieval-Augmented Experience

The Memory System archives all past interaction traces, contextual environmental information, and reusable Skill code (for code-centric games). Planning and Skill retrieval employ RAG (Retrieval-Augmented Generation) atop similarity search (ChromaDB backend), ensuring tractable prompt windows and effective leveraging of prior data. Memory records are enriched with personality-preferring annotations and are selectively retrieved based on personality, situational, and skill relevance, mediated via vector similarity and descriptions.

Extensibility and Practical Deployment

One of MIMIC-Py's primary contributions is its operational separation of game-specific engineering from core agentic logic. Extensibility is realized through three principal adaptation loci:

Personality Profiles: Personality-induced behavior is configured through editable prompts, facilitating both expansion and n-way behavioral ablation studies with no code changes.
Plan-to-Action Translators: API-driven or code-centric translation logic is confined to environment- and interaction-specific modules. Boilerplate for socket protocol and feedback exchange is provided, radically minimizing integration effort.
Prompt and State Representation Templating: Game-specific prompt templates and abstracted state representations isolate all remaining adaptation work, allowing users to register new environments by updating less than 130 lines of code on average for new game integration (empirically demonstrated on three disparate games).

The workflow is highly automatable and walkthroughs, templates, and reference implementations are provided downstream.

Figure 2: Confirmation message during LAN server initialization, critical for connecting the Action Executor with Minecraft via Plan-to-Code interface.

Figure 3: Successful in-game connection to MIMIC-Py in the Minecraft environment, demonstrating robust cross-process interaction.

Agent Deployment Examples

MIMIC-Py’s flexibility was validated in deployments across Dungeon Adventures, Shattered Pixel Dungeon, and Minecraft, with action execution pipelines ranging from direct API invocation to dynamic script generation. The framework consistently required only minor game-specific changes centered on configuration and the execution bridge, with core planning, memory, and deliberation components untouched.

Empirical Strengths and Claims

Although comprehensive experimental results are detailed externally, MIMIC (the precursor framework) has produced up to 1.3× improvements in branch coverage and 14.46× improvements in interaction-level coverage over random agents in large-scale settings. Notably, when tested on Minecraft versus state-of-the-art agents (e.g., ODYSSEY), the system achieved successful completion of more complex multi-step objectives with superior behavioral diversity, substantiating the advantage conferred by personality-driven planning.

Limitations and Future Directions

Despite its robust architecture, MIMIC-Py exhibits significant action latency (12.4 seconds/action on average) and incurs non-negligible model invocation costs ($0.06 per action with code generation). This precludes use in time-constrained genres (e.g., FPS games) and limits large-batch viability without further optimization. The authors propose future work around locally fine-tuned models for improved efficiency and scaling, as well as extending the modular system for UI and HCI testing domains where user behavioral diversity is similarly critical.

Conclusion

MIMIC-Py offers a practical, extensible framework for deploying personality-driven, LLM-based agents for automated game testing. By decoupling planning, execution, and memory systems and minimizing environment-specific requirements through prompt engineering and lightweight interaction bridges, MIMIC-Py transforms previous research prototypes into a genuinely reusable tool suitable for a wide range of game environments. The underlying architectural principles are readily transferable to broader classes of interactive systems beyond digital games, positioning MIMIC-Py as a foundation for large-scale, behaviorally rich, automated quality assurance.

Reference: "MIMIC-Py: An Extensible Tool for Personality-Driven Automated Game Testing with LLMs" (2604.07752)

Markdown Report Issue