Create a Video View Paper

Learning Next Action Predictors from Human-Computer Interaction

This presentation explores how AI can anticipate user behavior by learning from rich, longitudinal interaction histories. We examine the next action prediction task, introduce the NAPsack data collection pipeline that captured 1.9 million screenshots from 20 users over a month, and reveal how the LongNAP architecture achieves up to 79% improvement by reasoning over unbounded multimodal context to predict what users will do next.

Script

Most AI systems react to what you ask them. But what if they could anticipate what you'll do next? This research introduces next action prediction: training models on the full history of your interactions, screenshots, clicks, and sensor data, to forecast your upcoming sequence of actions before you take them.

The authors formalize next action prediction as modeling a temporal stream of events, each containing an action and optional visual context. The challenge is predicting future event trajectories from extended, unbounded histories of real user behavior. This shifts AI from reactive task completion to proactive understanding of individual patterns.

To train such models, you need authentic behavioral data at scale.

The researchers introduce NAPsack, an open-source pipeline that passively records user interactions, compresses them intelligently using event-driven heuristics, and automatically generates action labels via vision-language models. Applied to a month of data from 20 users, it produced 360,000 action descriptions spanning 1,800 hours, with validation confirming the annotations align with human judgment.

LongNAP's architecture embodies a two-phase reasoning process. First, it generates chain-of-thought traces to semantically query a memory bank of prior interactions. Second, it integrates those retrieved traces with current context to predict the user's next actions. The entire pipeline, including discrete reasoning and retrieval steps, is optimized end-to-end using policy gradients with rewards from a temporal judge that compares predicted trajectories to what actually happened.

When trained on a single user's data, LongNAP achieves a 79% improvement over supervised fine-tuning and 39% over the strongest prompted baseline. Cross-user performance is more modest but still significant, with a 13% improvement over few-shot baselines when generalizing to new users. Calibration analysis reveals that high-confidence predictions are substantially more accurate, and user-level predictability varies dramatically, suggesting some behavioral patterns are inherently more forecastable than others.

Next action prediction transforms AI from reactive responders into anticipatory collaborators that understand the rhythm of how you work. To explore this research further or create your own research videos, visit EmergentMind.com.