- The paper introduces GUMs, an architecture that transforms unstructured computer use data into confidence-weighted natural language propositions about user behavior.
- The design integrates modules—Observe, Audit, Propose, Retrieve, and Revise—to continuously update and refine user models while respecting privacy.
- Evaluations demonstrate high accuracy (up to 79%) and low privacy violations, indicating the framework's promise for proactive, context-aware computing.
This paper introduces General User Models (GUMs), a novel architecture designed to create comprehensive, dynamic, and private computational representations of a user's behavior, knowledge, beliefs, and preferences. The core problem GUMs address is the fragmented and narrow nature of current user models, which limit the ability of AI systems to understand users deeply and act proactively across different contexts.
GUM Architecture and Functionality
GUMs learn by observing any unstructured interaction a user has with their computer, such as device screenshots, file system activity, or notifications. The architecture processes these inputs to construct and refine a set of natural language, confidence-weighted propositions about the user. For instance, from a screenshot of a message thread about a wedding, a GUM might infer "User is likely going to friend's wedding in Chicago" (Confidence: 0.8) and "User doesn't own any suitable formal wear" (Confidence: 0.6).
The GUM architecture consists of several key modules:
- Observe: This module ingests raw, unstructured user interaction data. The paper details Observers for screen content (using a Vision-LLM like Qwen 2.5 VL to transcribe screenshots and user actions) and OS notifications.
- Audit: Before processing, observations are audited for privacy. This module uses the GUM itself to infer user-specific contextual integrity norms (based on Nissenbaum's theory). It estimates if the user would expect the observed information to be recorded, filtering out sensitive data (e.g., banking credentials).
- Propose: Validated observations are transformed into natural language propositions. This step involves generating a reasoning trace that links the observation to the proposition and assigning a confidence score (0-1) to the proposition. The underlying LLM (e.g., Llama 3.3 70B) is prompted to generate these, including a proposition-specific decay factor indicating how quickly a proposition might become stale.
- Retrieve: To provide context for new inferences and revisions, this module efficiently searches existing propositions. It uses a two-stage retrieve (BM25 for initial candidates) and rerank (LLM-based) approach. Relevance scores are adjusted for recency using the proposition's decay factor, and diversity is incorporated using Maximum Marginal Relevance (MMR).
- Revise: New propositions, along with retrieved similar propositions, are fed into this module. It re-evaluates and refines existing propositions by potentially merging, updating (including confidence and reasoning), or marking them as contradictory. This continuous revision ensures the GUM adapts over time.
All data is intended to stay on the user's device, enabling local inference on capable hardware, with a focus on open-source models to maintain privacy. The GUM exposes an API for applications to query its propositions, allowing them to access this rich user context.
Applications of GUMs
The paper outlines several applications for GUMs:
- Augmenting LLM Prompting: GUMs can provide LLMs with relevant user context (e.g., recent documents, ongoing tasks) when a user makes an underspecified query like "help me with this section." This allows the LLM to generate more personalized and helpful responses without explicit context rebuilding by the user.
- Enhancing Operating Systems and Applications:
- Calm Technology: An OS could use a GUM to filter notifications, surfacing only genuinely important ones based on the user's current context and priorities (e.g., a conference deadline over a recipe email during work).
- Interface Agents: GUMs can provide the continuously learned user model needed for proactive interface agents (e.g., a movie recommender that knows now is a bad time because the user is on a deadline).
- Context-Aware Computing: GUMs offer a flexible way to unify context from diverse inputs (screenshots, sensor data, text logs) for any application to query and adapt.
- GUMBO: A Proactive Assistant:
GUMBO is a demonstration application built on GUMs. It continuously captures screenshots, builds a GUM, and then:
- Discovers Suggestions: Generates candidate suggestions based on new propositions and related context from the GUM (e.g., "Search for cheap suit rentals in Chicago" based on wedding invite and budget propositions).
- Determines Utility of Interruption: Uses the GUM to estimate the benefit of a suggestion, cost of a false positive, and cost of a false negative, applying Horvitz's mixed-initiative interaction principles to decide if and when to surface a suggestion. A token-bucketing algorithm rate-limits suggestions.
- Executes Suggestions: If a suggestion is deemed useful and executable without irreversible side effects, GUMBO attempts to complete it (e.g., searching for suit rentals) and presents preliminary results. It can delegate tasks to tools (e.g., web search via Gemini 2.0 Flash, local file search).
- Incorporates Feedback: User feedback (thumbs up/down, natural language) is fed back into the GUM as new observations.
Implementation and Evaluation
- Models: Qwen 2.5 VL for screen observation, Llama 3.3 70B for proposition generation/revision.
- Evaluation 1: Accuracy and Calibration (Email Data)
- N=18 participants exported their last 200 emails. GUMs were trained on this data.
- Participants judged proposition accuracy and ranked outputs from the full GUM versus ablations (No {Retrieve, Revise}; No {Retrieve}).
- Results: Full GUM propositions were significantly preferred. They were accurate (76.15% overall) and well-calibrated (Brier score 0.17), tending to be underconfident when wrong. Highly confident propositions (1.0) were 100% accurate. All GUM components (Retrieve, Revise) were found to be critical.
- Evaluation 2: Privacy Audit Module (Email Data)
- N=18 participants identified propositions that violated their privacy preferences and rated the GUM's contextual integrity reasoning.
- Results: Only 7 out of 180 propositions from the full GUM were flagged as violations. Participants generally agreed with the GUM's contextual integrity assessments. However, some GUMs were manipulated by phishing emails or made uncomfortable social inferences.
- Evaluation 3: End-to-End via GUMBO (Screen Capture Data)
- N=5 participants used GUMBO for 5 days (24-hour burn-in, 4 days active use). GUMBO observed their screen interactions.
- Data collected via annotations and semi-structured interviews.
- Results: GUMs remained accurate (79%) and calibrated (Brier score 0.28). GUMBO provided strong-to-excellent suggestions for all participants (25% rated 6 or 7 on a 7-point Likert scale). Useful suggestions included immediate, low-level assistance and help in areas users hadn't considered AI for.
- Some propositions were perceived as "too accurate" or judgmental (e.g., "P1 is struggling with fixing bugs").
- Errors included GUMBO attempting tasks beyond its capabilities ("General Clippy Models") or being too eager.
- Privacy was a major concern, though some users habituated. Trust in the researchers was key for participation.
Discussion and Limitations
- Reflecting on Propositions: The candid nature of GUM propositions led to varied user reactions, from feeling judged to engaging in self-reflection. This raises design questions about proposition visibility and framing.
- The Privacy Paradox: While initially concerned, some users considered sharing more data to improve GUM performance, highlighting a trade-off.
- Ethical Risks:
- Persuasion/Surveillance: Emphasizes local hosting and user control to mitigate misuse.
- Manipulation: GUMs can be "poisoned" by malicious inputs (e.g., spam emails), requiring robust filtering or detection mechanisms.
- Bias: GUMs inherit biases from underlying LLMs.
- Limitations:
- Sampling Bias: Evaluations primarily involved technical users familiar with AI.
- Hallucinations: LLMs can still generate incorrect, albeit sometimes calibrated, propositions.
- Narrow Context: Screen interactions are a proxy and don't capture the user's full context.
- Future Work: Integration of more modalities (audio, health sensors) and more capable underlying models.
The paper concludes that GUMs offer a promising framework for creating general, rich user models that can understand users across various contexts, paving the way for more intelligent and proactive human-computer interaction.