- The paper presents a novel offline reinforcement learning approach that enables a visual perception-based non-embedded agent to play card-based RTS games.
- It introduces a generative dataset for training YOLOv8 detectors and employs delayed continuous action prediction to tackle sparse in-game actions.
- Experimental results reveal robust detection performance (68.8 mAP) and real-time decision-making efficacy on an RTX 4060 Laptop GPU.
This paper presents a method for training a non-embedded AI agent to play the card-based Real-Time Strategy (RTS) game Clash Royale using offline reinforcement learning (RL) based on visual inputs. Unlike embedded agents that access game state directly, this agent perceives the game through screen captures, mimicking human play and addressing the challenge of slower environment interaction unsuitable for online RL.
Problem:
Developing AI for complex card-based RTS games like Clash Royale is challenging due to vast state spaces, sparse rewards, incomplete information, and the need for real-time decision-making. Existing high-performing game AIs are often "embedded," directly accessing internal game data, which differs significantly from human visual perception. Creating "non-embedded" agents that rely on visual input is difficult, especially concerning the slow interaction speed that hinders online RL training and the lack of suitable datasets for visual perception models.
Proposed Approach:
The authors propose a comprehensive framework for a non-embedded agent:
- Visual Perception: Capturing game screens from a mobile device.
- Feature Extraction: Using object detection (YOLOv8) and Optical Character Recognition (OCR) to identify units (troops, towers, buildings), their positions, factions, health, hand cards, elixir count, and game time.
- Offline RL Decision Making: Training a decision-making model using an offline RL algorithm on pre-collected expert gameplay data.
- Control: Executing the agent's decisions (card selection and placement) on the mobile device.
Key Implementation Details:
- Generative Object Detection Dataset: Due to the lack of a public dataset for Clash Royale object detection, the authors created a novel generative dataset approach.
- They collected image slices of different game units (troops, towers, etc.).
- They developed an algorithm (Algo. 1) to procedurally generate training images by placing these slices onto background arena images with data augmentation (random positions, overlaps, background elements). This involves layering units correctly (e.g., ground units below air units) and filtering based on overlap thresholds.
- Segment Anything Model (SAM) (Kirillov et al., 2023) was used initially with manual filtering to extract unit slices from gameplay videos.
- This generative dataset was used to train YOLOv8 models.
- Object Detection Model:
- They experimented with YOLOv5 and YOLOv8 architectures.
- To handle the large number of classes (150) and varying object sizes, they proposed splitting the detection task across multiple YOLOv8-l models (YOLOv8-l x2, YOLOv8-l x3), where each model specializes in detecting objects within a specific size range (determined by average slice area, see Fig. 3).
- The YOLOv8-l x3 model achieved the best mAP (68.8), especially for small objects (mAP(S) 48.3), validating the generative dataset and multi-detector approach.
- State and Action Representation:
- State (S): Divided into image features (Simg) and card/elixir features (scard). Simg is a 18×32×15 grid, encoding unit category, faction, health, etc., at each location. scard includes hand card indices and total elixir.
- Action (A): Composed of deployment position (apos) and selected hand card index (aselect).
- Reward Function (r): Designed to guide the agent towards winning objectives:
- rtower: Change in tower health (positive for damaging enemy, negative for taking damage).
- rdestory: Large reward/penalty for destroying/losing towers.
- ractivate: Small penalty for activating the enemy king tower prematurely.
- relixir: Penalty for letting elixir overflow.
- Decision Model:
- They adapt the Decision Transformer (DT) (Chen et al., 2021) concept, specifically using the StARformer (Wang et al., 2022) architecture, which processes sequences of states, actions, and returns-to-go (Rt=∑t′=tTrt′+1).
- The StARformer uses spatial cross-attention (within a timestep) and temporal causal attention (across timesteps) (Fig. 5).
- They compared StARformer-2L, StARformer-3L (different input structures), and a standard DT-4L. StARformer-3L performed best.
- Handling Sparse Actions: Actions occur infrequently (4% of frames). To address this imbalance in the offline dataset:
- Delayed Action Prediction: Instead of predicting discrete actions (play/don't play), the model predicts a continuous value: the time (in frames) until the next action, capped at a threshold Tdelay (Fig. 6). An action is triggered when the predicted delay is close to zero. This improved performance significantly (37% reward increase compared to discrete prediction).
- Resampling: During training, trajectories ending near an action frame are sampled more frequently using a defined weighting scheme (Eq. \ref{eq-resample-freq}) to mitigate model bias towards inaction.
Experiments and Results:
- Dataset: Manually collected expert dataset of 105 games (vs. built-in AI), ~114k frames. Open-sourced generative dataset (~4.6k slices, 154 categories) and validation set (~7k images, ~117k boxes).
- Detection Performance: YOLOv8-l x3 achieved 85.2 AP50 and 68.8 mAP on the validation set (Tab. 2). Real-time tracking achieved 10 FPS on an RTX 4060 Laptop.
- Decision Performance: The agent was evaluated by playing against the built-in AI. StARformer-3L (L=50) with continuous action prediction achieved the highest average reward (-4.7) and number of actions (207.8), with a 5% win rate (Tab. 3). StARformer-3L (L=30) achieved the highest win rate (10%).
- Real-time System: The full pipeline (perception, fusion, decision) runs at ~360ms per cycle (120ms decision, 240ms perception/fusion) on an RTX 4060 Laptop GPU.
Conclusion:
The paper successfully demonstrates a non-embedded offline RL agent for Clash Royale using visual input. Key contributions include the generative dataset technique for object detection, the application of offline RL (StARformer) with delayed action prediction and resampling for a non-embedded visual agent, and the open-sourcing of code and datasets. While the agent can defeat the built-in AI, its performance is not yet at human level and relies on fixed card decks. Future work includes exploring online RL, improving perception/decision models, and handling varied decks.