Bot Usability Scale (BUS) Overview
- BUS is a standardized tool for assessing Voice User Interface usability using 10 Likert-scale items across four sub-scales.
- It operationalizes usability through System Performance, Speed, Understanding, and System-State, incorporating reverse-coded items for accuracy.
- Developed via rigorous PCA and empirical testing, BUS supports both between-subject and within-subject study designs in user experience research.
The Bot Usability Scale (BUS) refers to the Usability component within the PUEVA Inventory, a standardized assessment tool created to evaluate subjective qualities—including usability—of contemporary Voice User Interfaces (VUIs). The BUS operationalizes usability as a multidimensional construct, quantified through a battery of 10 Likert-scale questionnaire items covering four distinct sub-scales: System Performance, Speed, Understanding, and System-State. Derived through rigorous item reduction and factor analytic procedures, the BUS offers a compact, empirically optimized means of capturing end-user perceptions of VUI usability in both between-subject and within-subject experimental paradigms (Li et al., 2021).
1. Conceptual Foundations and Overview
The BUS, as defined in the PUEVA Inventory, emerges from a meta-analytic and empirical approach to assessing the non-functional qualities of voice assistants. PUEVA divides assessment into Personality, Usability, and Enjoyability; the BUS constitutes the Usability section. This construct is explicitly operationalized as a set of 10 statements requiring respondent agreement on a 5-point Likert scale, where higher scores generally indicate more favorable usability perceptions. Each questionnaire item is mapped to one of four sub-scales, reflecting separate aspects relevant to human-agent interaction quality.
2. Item Structure and Sub-Scale Composition
The BUS comprises ten items, each elucidating a facet of usability. Items are grouped into four sub-scales, each probing a theoretically motivated dimension of the VUI experience. All responses employ a 5-point Likert scale (implicit anchors assumed: 1 = Strongly Disagree, 5 = Strongly Agree).
| Sub-Scale | Items (verbatim wording) | Reverse-coded |
|---|---|---|
| System Performance | 1. “The SYSTEM didn’t always do what I wanted.”<br\>2. “The SYSTEM didn’t always do what I expected.”<br\>3. “A high level of concentration is required when using the SYSTEM.”<br\>4. “I think I can use the SYSTEM without any help.” | 1, 2 (asterisked) |
| Speed | 5. “I found the interaction with the SYSTEM to be fast.”<br\>6. “The SYSTEM always gives me an immediate response.” | None |
| Understanding | 7. “The SYSTEM understands difficult conversation topics.”<br\>8. “The SYSTEM adapts well to new situations.” | None |
| System-State | 9. “I was able to recover easily from errors.”<br\>10. “The SYSTEM is well-organized.” | None |
Within the System Performance scale, items 1 and 2 are explicitly designated for reverse-coding.
3. Scale Structure and Factor Model
The overall structure of the BUS was determined through a series of iterative principal component analyses (PCA) with direct oblimin rotation. The development process began with the administration of a 103-item superscale to a convenience sample (N = 13), with each participant interacting for approximately 10–15 minutes with three commercial voice agents (Google Home, Alexa Dot, Apple HomePod) before rating items. The factor analytic reduction employed PCA eigenvalue thresholding (eigenvalues > 1), with the authors retaining only items loading at or above 0.6 on their assigned component. The final scree plot documented 14 components accounting for over 95% of the variance. Four distinct and theoretically coherent usability sub-scales emerged and were preserved in the condensed 35-item PUEVA version (Li et al., 2021).
No confirmatory factor analysis or further goodness-of-fit indices are reported. The only published psychometric threshold is a minimum PCA loading of 0.6 for each retained item.
4. Administration and Scoring Protocols
The BUS is designed for brief administration, with the full PUEVA inventory (35 items) requiring approximately 5 minutes for participant completion; the 10 usability items constitute a subset. The authors provide no precise scoring algorithm or mathematical formulae. By standard practice, reverse-coding is applied to items specifically noted (System Performance items 1 and 2), sub-scale scores are typically calculated by the mean (or sum) of component items, and an overall usability score may be derived by averaging the four sub-scale scores. The PUEVA documentation does not mandate a global usability aggregation nor provide formulaic definitions for such. No procedural innovations or departures from conventional Likert-scale survey administration are prescribed.
5. Validation and Empirical Testing
Validation of the BUS occurred within a within-subjects laboratory design. Thirteen participants formed a convenience sample but no demographic data are reported. Each subject interacted sequentially with three mainstream voice agents, then completed the 103-item preliminary instrument. Statistical validation was limited to the iterative PCA process as described above. No Cronbach’s alpha, CFA loadings, or other internal consistency metrics are provided. The only psychometric confirmation is the item's retention threshold (PCA loading ≥ 0.6). This suggests caution should be employed in generalizing the measure’s properties, given the preliminary sample size and absence of broader reliability or validity indices.
6. Use Cases and Methodological Considerations
The BUS is suitable for both between-subject and within-subject research designs examining VUI usability perceptions. The instrument is compatible with laboratory, field, or comparative studies, although specific procedural recommendations (such as timing relative to interaction or counterbalancing protocol) are not provided in the referenced documentation. The overall PUEVA toolkit is intended as a standardized comparative framework for subjective qualities, supporting the controlled evaluation of user experience across different conversational agents (Li et al., 2021). A plausible implication is that the BUS offers a tractable means for benchmarking human-agent usability differences, provided methodological limitations regarding sample size and scale validation are acknowledged.