CodeWatcher: Telemetry for VS Code

Updated 20 October 2025

CodeWatcher is a lightweight, unobtrusive system that captures millisecond-level VS Code events to study AI and human coding interactions.
It integrates a VS Code plugin, Python REST API, and MongoDB backend to log detailed events such as insertions, deletions, copy-paste, and focus changes.
The platform enables nuanced behavioral analytics and empirical research on responsible AI, productivity, and code provenance in modern software development.

CodeWatcher is a lightweight, unobtrusive telemetry infrastructure designed for real-time capture and analysis of developer interactions within the Visual Studio Code (VS Code) environment, with special emphasis on understanding and evaluating the use of code generation tools (CGTs) powered by LLMs. Its architecture comprises an in-editor telemetry plugin, a containerized backend consisting of a Python-based REST API, and a MongoDB database, enabling continuous logging of semantically meaningful events (insertions, deletions, copy, paste, editor focus changes) with millisecond precision. The system is intended to support reconstruction of detailed coding sessions and enable nuanced behavioral analyses, which are crucial for research on responsible AI, productivity, and human-centered evaluation of automated code generation workflows (Basha et al., 13 Oct 2025).

1. Objectives and Role

CodeWatcher’s principal function is to provide fine-grained, real-time telemetry of developer actions during coding sessions without disrupting the user’s workflow. Unlike coarse-grained logging or ex post facto analysis, the system records detailed chronological events—including precise code insertions (user- or CGT-originated), deletions, copy-paste actions, and focus/unfocus transitions—allowing for process-level studies of how LLM-powered tools are integrated into modern development. The system is targeted at researchers, practitioners, and educators seeking to empirically examine code provenance, AI-human code co-creation patterns, and derivations of productivity or authorship within software engineering processes.

2. System Architecture and Components

The architecture centers on a modular, microservices-inspired approach comprising:

VS Code Plugin (Client):
- Implemented in JavaScript, the plugin leverages VS Code’s extension API to monitor and capture low-level editor events. Each logged action is packaged with contextual metadata (event type, millisecond timestamp, text content, affected line).
- Event types include: Start, End, Insertion, Deletion, Focus, Unfocus, Copy, and Paste.
Python RESTful API Server:
- Built with FastAPI, the server layer handles client communication and routes incoming events to persistent storage. Input validation is enforced via Pydantic schemas, ensuring strict adherence to data contracts.
- The API supports endpoints for logging interactions, registering users, updating permissions, and querying logs.
MongoDB Backend:
- Event data is persistently stored in a schemaless, JSON-friendly format. Collections are divided into users and interaction_logs, supporting flexible event schemas and scalable growth as new event types arise.
- This architecture is stateless and horizontally scalable, supporting deployment on remote Ubuntu servers via Docker containers.

This infrastructure enables deployments spanning local academic labs to cloud-hosted industrial telemetry solutions, with minimal friction in installation and operation.

3. Data Capture Semantics and Analysis

CodeWatcher captures and structures telemetry with high temporal resolution, generating a chronological stream that allows for post-hoc reconstruction of complete coding sessions. Each entry logs:

Event Type: (e.g., Insertion, Deletion, Copy, Paste, Focus, Unfocus)
Timestamp: Millisecond precision, enabling precise interval and sequence analysis.
Text: Actual code or input; for insertions and deletions, the code snippet.
Line: For contextualizing where in the document the change occurred.

A salient feature is the ability to differentiate between code written directly by the user, code inserted by a CGT (such as an LLM auto-completer), and code that was AI-generated but subsequently modified by the user. The process uses a fuzzy similarity metric, $S(L_f, L_h)$ , computed between the final submission and snapshots from historical logs:

$\begin{cases} S \geq 95 & \text{→ “AI-Generated”} \ 80 \leq S < 95 & \text{→ “AI-Modified”} \ S < 80 & \text{→ “User-Written”} \end{cases}$

This labeling supports detailed provenance tracing and allows quantification of CGT engagement (e.g., empirical findings from a provided demo show 68% of code as AI-generated, 7% AI-modified, 25% user-written).

4. Research and Educational Applications

CodeWatcher provides research-grade process data enabling investigations well beyond traditional outcome-based or coarse static analyses:

Human–AI Interaction Studies: Continuous, unobtrusive telemetry enables the analysis of when, how frequently, and in what context developers accept LLM suggestions, subsequently edit generated code, or revert to manual coding.
Behavioral and Productivity Analytics: Fine-grained timing and action data support studies of task segmentation, time spent editing versus accepting suggestions, and potential productivity bottlenecks or accelerators introduced by CGT integration.
Educational Feedback: Instructors and evaluators can analyze engagement: for example, distinguishing over-reliance on AI completions from active problem solving, enabling nuanced formative feedback.
Authorship and Provenance Auditing: Accurate record-keeping and event logging offer foundational data for disputes regarding code authorship, compliance, and accountability in regulated or collaborative environments.

5. Technical Validation and Demonstration

A demonstration instance of CodeWatcher, described in the paper, validates its accuracy and utility:

Event Type	Logged Fields	Use Case Example
Insertion	type, time, text, line	Accepting an LLM’s suggestion at line 42
Deletion	type, time, text, line	Removing a bug in a previously suggested block
Focus / Unfocus	type, time	Tracking attention or external resource switching
Copy / Paste	type, time, text, line	Monitoring copy-paste coding strategies

The system supports empirical studies by reconstructing session timelines and computing code provenance. This enables, for example, an algorithmic assignment in which code segments in assignments are labeled post-hoc (AI-generated, AI-modified, user-written) using the aforementioned fuzzy similarity thresholds.

Validation experiments confirm that approximately two-thirds of code in a typical session may be directly LLM-generated, providing a strong basis for further research on the cognitive processes and workflow impacts of CGT use.

6. Implications for Human-Centered AI and Future Directions

CodeWatcher’s unobtrusive, detailed logging infrastructure lays the foundation for rigorous empirical analyses essential to responsible AI research, adaptive tooling, and precision education in programming. The system is well positioned for:

Integration with NLP modules for deeper semantic analysis of natural language comments and intent extraction.
Extension to comparative studies (as recommended in related research (Javahar et al., 13 Oct 2025)), supporting cross-tool evaluation (LLM vs. human, CGT vs. CGT).
Serving as a backend for feedback-driven interventions in IDEs, such as adaptive suggestion filtering or student support cues based on detected engagement patterns.

A plausible implication is that with further cross-correlation (e.g., window switches to external documentation), CodeWatcher could help quantify when and why LLM suggestions are insufficient, providing insights that feed back into the iterative design of more context-aware, effective CGTs.

7. Summary and Significance

CodeWatcher represents a robust, research-oriented solution for collecting and analyzing IDE telemetry data at a resolution suitable for modern LLM-powered software development. Through its detailed architecture, precise event capture, and empirical validation, it establishes a basis for empirical studies on code generation workflows, human–AI co-production, and educational interventions. This infrastructure is central to progress in understanding, and ultimately improving, the intersection of human expertise and automated code generation in contemporary programming environments (Basha et al., 13 Oct 2025).