Vibe Checker: Evaluating Human-Centric Systems

Updated 9 October 2025

Vibe Checker is a framework that quantifies human experience and sensorimotor states across coding, emotion recognition, and prosthetic systems.
It integrates automated frameworks like VeriCode to combine functional correctness with stylistic and usability metrics in code evaluation.
Applications include ambient emotion detection using sensor data and enhancing prosthetic embodiment through precise vibrotactile feedback.

Vibe Checker refers to a broad class of systems and methodologies that quantify, infer, or evaluate subtleties in human experience, preference, or sensorimotor state—whether through direct user interaction, ambient physiological signals, or the process of automated code evaluation. Recent research elucidates three distinct but thematically convergent technologies: evaluation of generative code through alignment with human preference in "vibe coding," unobtrusive emotion recognition from footstep-induced floor vibrations, and the augmentation of tactile feedback and embodiment via embedded prosthetic actuators. Across these domains, a "vibe check" typically denotes a composite assessment that extends beyond raw functional success, encompassing style, usability, affect, and user-centered feedback.

1. Vibe Check in Generative Coding Systems

The concept of vibe checking in code generation is operationalized as an evaluative standard that extends past functional correctness, capturing human-centric attributes valued during iterative code refinement. Users engaging in "vibe coding" interact with LLMs via natural language to produce or edit code until the solution passes their subjective vibe check—a process encompassing stylistic appropriateness, intention preservation, readability, and authenticity, alongside correct execution.

Traditional code evaluation frameworks—primarily metrics like pass@k—assess only the completion of unit tests or program outputs. Vibe Checker systems supplement this with non-functional instruction adherence, quantifying aspects such as line length, logical structure, documentation form, and error handling. These criteria are encoded in the VeriCode taxonomy, which systematizes 30 verifiable code instructions derived from industrial best practices, organizing them into Coding Style, Logic Patterns, Documentation, Error Management, and API Constraints.

Functional correctness and instruction following are blended using a composite score:

$\text{Composite Score} = \alpha \cdot \text{IF} + (1 - \alpha) \cdot \text{Func}$

where $\text{IF}$ is an instruction following metric and $\text{Func}$ is the functional correctness score. This composite correlates maximally with human preference, as shown in large-scale evaluations against human votes on the LMArena platform (Zhong et al., 8 Oct 2025).

2. Taxonomies and Automated Verification: The VeriCode Framework

VeriCode operationalizes the assessment of code beyond output validation by furnishing a set of 30 binary-verifiable instructions, categorized and paired with deterministic, tool-driven verifiers. These include:

Category	Example Instruction	Verification Tool
Coding Style	Max line length (e.g. 79 chars)	Ruff linter
Documentation	Docstring format (e.g. NumPy)	AST/parser checks
Error Management	Canonical exception names	AST/semantic check
Library API	Use pathlib, not os/open	AST substitution

Each instruction is parameterizable (e.g., line length, number of branches), promptable in distinct formats (single-turn, multi-turn), and mapped to an automated pass/fail mechanism. Evaluation can be performed per instruction ( $\text{IF}_{\text{instruction}} = \frac{1}{k}\sum_{j=1}^k I_j$ ) or at the task level ( $\text{IF}_{\text{task}} = \mathbb{I}\left[\sum_{j=1}^k I_j = k\right]$ ).

Functional regression, manifested as a pass@1 decrease when non-functional constraints are imposed, establishes the challenge models face: introducing five instructions results in a typical score reduction of $\sim$ 5-6% vs. base performance; satisfying all instructions simultaneously drops below 50% success for many state-of-the-art LLMs.

Position bias emerges, with instructions at the start or end of a list more likely to be followed—a "lost-in-the-middle" effect—in single-turn settings. Multi-turn editing shows a recency bias.

3. Vibe Assessment from Ambient Sensor Modalities

The use of ambient sensor data (e.g., floor vibrations) to detect internal states represents an alternative application of the vibe checker paradigm. EmotionVibe (Wu et al., 6 Mar 2025) leverages geophone sensors monitoring floor vibrations induced by footsteps to infer users' emotional states—specifically valence and arousal—by relating features of gait and vibration to affective states.

Key components include:

Segmentation of impulse signals from continuous vibration data using wavelet coefficients and event windows.
Feature engineering across gait (step frequency, FWHM, peak ratios), and vibration domains (spectral, temporal, cepstral features).
Personalization through gait similarity indices (GSI), computed as the inverse average Euclidean distance between target feature embeddings and those in the training set:

$d(E_T, E_i) = \|E_T - E_i\|_2$

Weighted neural network losses during fine-tuning that prioritize training data from individuals with similar gait patterns.

Experimental results across 37,001 footsteps from 20 participants yield mean absolute errors (MAE) of 1.11 (valence) and 1.07 (arousal) when personalized, representing error reductions of 19.0% and 25.7% over unpersonalized baselines.

4. Sensorimotor Vibe Checking in Prosthetic Embodiment

Vibe Checker technology in prosthetic systems is exemplified by VIBES (Vibro-Inertial Bionic Enhancement System) (Ivani et al., 22 Aug 2024), which delivers high-frequency, real-time vibrotactile stimulation to the skin inside a prosthetic socket. This system comprises planar actuators, IMUs for force measurement, and custom signal processing pipelines (DFT321 algorithm for dimensionality reduction, noise filtering, and real-time PWM mapping).

Psychophysical characterization is conducted using the method of constant stimuli, computing Just Noticeable Differences (JND) and Points of Subjective Equality (PSE) through logistic regression:

$P(Y_j = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_j)}}$

with $\, \text{JND} = 1/|\beta_1|$ and $\, \text{PSE} = -\beta_0/\beta_1$

Empirical findings include:

JND of 58.79 $\mu$ m (95% CI: 54.41–66.19) and 64.10 $\mu$ m (95% CI: 58.19–69.42) for two forearm configurations; in a prosthesis user, JND is 44.07 $\mu$ m.
Improvements in texture identification accuracy (able-bodied: 50% $\to$ 62%; prosthesis user: 40% $\to$ 52%) with VIBE feedback.
No significant effect or mild benefit in slippage detection and fragile object manipulation, with minor changes in reaction time and error rates.

Prosthetic embodiment is assessed using the Rubber Hand Illusion (RHI). Vibrotactile feedback, both synchronous (SF) and asynchronous (AF, 500 ms delay), yield statistically higher ownership ratings compared to controls, indicating actionable enhancement of prosthesis-user embodiment.

5. Applications, Constraints, and Broader Impact

Vibe Checker paradigms unify several research domains through their focus on subjective and sensorimotor alignment:

In coding, the methodology enables the development and benchmarking of generative models that better match end-user preferences, offering objective mechanisms to incentivize instruction adherence alongside correctness.
In affective computing, nonintrusive, privacy-preserving emotion monitoring via ambient sensor infrastructure offers advances in smart enviroments and wellness monitoring, with implications for early mental health intervention.
In prosthetics, augmenting tactile feedback not only improves operational dexterity but also contributes to embodied cognition and acceptance—addressing key issues in prosthesis rejection and phantom phenomena.

Challenges persist: functional regression under compounded constraints in LLM code generation, between-person variability in sensor-based emotion inference, and optimization of actuator placement or signal algorithms for prosthesis feedback.

A plausible implication is that future systems integrating vibe checking will need to incorporate more sophisticated personalization, hybrid evaluation metrics, and multi-modal feedback, aiming for improved user experience and robust objective alignment with subjective expectations.

6. Future Directions and Limitations

Ongoing research is exploring:

Scaling up the Vibe Checker testbed to include broader and harder code constraints, and integrating feedback signals into model training.
Expanding subject pools in prosthetic and ambient emotion detection experiments for greater statistical power and generalizability.
Investigating complementary modalities (e.g., force feedback in prosthetics) and multi-sensor fusion in emotion monitoring.
Addressing signal processing advancements for unobtrusive and context-sensitive feedback in daily use.

It is evident from recent experimental and benchmarking results that vibe checking, as an operational principle, is indispensable for systems aiming to harmonize objective functionality with diverse human preferences across software, sensorimotor, and affective domains.