AI Research Assistant for Computer Scientists

Synthesize the latest research on any AI/ML/CS topic

Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
95 tokens/sec
GPT-4o
76 tokens/sec
Gemini 2.5 Pro Pro
63 tokens/sec
o3 Pro
18 tokens/sec
GPT-4.1 Pro
77 tokens/sec
DeepSeek R1 via Azure Pro
24 tokens/sec
2000 character limit reached
Done in 1m 2s
Queued
Line Arrow Right 1 Streamline Icon: https://streamlinehq.com
The user is asking about studies regarding frustration or aggression caused by voice assistants mishearing trigger words. The best approach to answer this is to search for relevant research papers. The arXiv search tool is suitable for this purpose. I will use keywords related to voice assistants, trigger words, mishearing, and user frustration in the search query.
Searching arXiv for "voice assistant trigger word errors user frustration"
Line Arrow Right 1 Streamline Icon: https://streamlinehq.com
I have reviewed the abstracts from the initial search and identified several papers that appear highly relevant to the user's query about frustration or aggression caused by voice assistants mishearing trigger words or experiencing other errors. To get a detailed answer, I need to examine the full content of these papers. I will use the paper prompter tool to extract specific findings related to user frustration, aggression, or negative psychological responses resulting from voice assistant errors, prioritizing the papers that seem most promising based on their titles and abstracts. I will start by prompting the top three most relevant papers.
Prompting "A Mixed-Methods Approach to Understanding User Trust after Voice Assistant Failures": "Summarize the findings related to user frustration or negative reactions resulting from voice assistant failures, specifically including failures caused by mishearing or overcapturing user input." (2303.00164)
Prompting "Can Voice Assistants Be Microaggressors? Cross-Race Psychological Responses to Failures of Automatic Speech Recognition": "Describe the psychological responses observed in users when voice assistants experience automatic speech recognition failures, focusing on any findings related to frustration, aggression, or negative emotional states." (2302.12326)
Prompting "Look at Me When I Talk to You: A Video Dataset to Enable Voice Assistants to Recognize Errors": "Detail any findings or observations in this paper regarding user frustration caused by voice assistant errors and the potential for recognizing such frustration or errors from user reactions." (2104.07153)
Line Arrow Right 1 Streamline Icon: https://streamlinehq.com
I have conducted an initial search and reviewed the abstracts. I then prompted the full content of three highly relevant papers that discuss user frustration and negative psychological responses stemming from voice assistant errors, including mishearing and overcapture. The information gathered from these steps provides a solid basis to answer the user's query regarding studies on frustration caused by voice assistants mishearing trigger words. While aggression was not a primary focus in the papers I prompted, frustration is well-documented. I have gathered sufficient information from the most promising sources found in the initial search. Therefore, I will now synthesize this information to generate the final response.
Line Arrow Right 1 Streamline Icon: https://streamlinehq.com
Synthesized final response with models/gemini-2.5-flash-preview-04-17
Done in 1m 2s
Line Arrow Right 1 Streamline Icon: https://streamlinehq.com
Processed 122,265 tokens in 5 steps

Numerous studies have investigated user interactions with voice assistants, and a subset of this research specifically addresses the negative consequences of system failures, including those stemming from misheard trigger words or erroneous Automatic Speech Recognition (ASR). While the term "aggression" is not commonly used in the examined literature to describe user responses, "frustration" and other negative psychological states are frequently reported outcomes of such technical deficiencies.

User Frustration and Negative Reactions to Voice Assistant Errors

Research indicates that voice assistant errors are a significant source of user frustration and can detrimentally impact user trust and overall experience. A mixed-methods paper investigating user trust after voice assistant failures categorized different types of failures, including those related to the system incorrectly capturing user input, referred to as "Perception" failures (Baughan et al., 2023 ). These Perception failures encompass issues like noisy channels interfering with input, overcapture (the system listening for too long), truncation (the system cutting off input too early), and transcription errors (mishearing words).

Quantitative analysis in this paper revealed that Overcapture failures were particularly harmful, resulting in the lowest reported trust scores regarding the voice assistant's perceived ability and benevolence compared to other failure types (Baughan et al., 2023 ). Qualitatively, users described Overcapture as aggravating, annoying, and a waste of time. Transcription errors, where the system mishears words or struggles with variations in speech (accents, names, foreign languages), were also found to negatively impact perceptions of benevolence and contributed to frustration and annoyance (Baughan et al., 2023 ). While Transcription errors did not statistically impact perceived ability as severely as Incorrect Actions, they were more detrimental to trust than errors due to noisy channels or truncation. Truncation errors, where the system stops listening prematurely, were also described as aggravating and annoying, increasing the time users needed to complete tasks (Baughan et al., 2023 ).

The aggregate effect of these Perception failures, stemming directly from the system's inability to accurately process user speech (including trigger words and subsequent commands), leads users to sometimes abandon the task for a period or simplify their interactions to avoid scenarios prone to such errors (Baughan et al., 2023 ).

Psychological Responses to ASR Failures

Beyond general frustration, studies have explored deeper psychological responses to ASR failures, particularly considering known biases in these systems. Research has shown that language technologies, including ASR, can exhibit differential error rates across demographic groups, notably higher rates for Black speakers compared to white speakers (Wenzel et al., 2023 ). A paper examining cross-race psychological responses to ASR failures found that Black participants interacting with a voice assistant exhibiting a high error rate reported significantly lower levels of positive affect, higher levels of self-consciousness, and reduced individual and collective self-esteem compared to Black participants in a low error rate condition (Wenzel et al., 2023 ). These findings were interpreted as consistent with the psychological impact of experiencing racial microaggressions, where persistent errors are perceived as subtle acts of bias reinforcing marginalization (Wenzel et al., 2023 ). While the paper used a Negative Affect scale (which includes items related to frustration), it did not find a statistically significant difference in negative affect between Black and white participants in the high error condition, although errors generally increased negative feelings for both groups. Aggression was not a measured response in this specific paper (Wenzel et al., 2023 ).

Another paper noted that users are "often frustrated by voice assistants' frequent errors" and explored the feasibility of recognizing these errors and the resulting user frustration from visual cues such as facial reactions (Cuadra et al., 2021 ). While recognizing errors and frustration from soundless video was challenging, the paper suggested that it is possible and warrants further investigation, potentially through multimodal analysis combining audio and visual data (Cuadra et al., 2021 ).

Implications for Voice Assistant Design

The documented user frustration and negative psychological responses highlight the critical need for robust and accurate wake word detection and ASR systems. Ongoing research focuses on improving these core components to mitigate errors. Efforts include developing more efficient and accurate voice trigger detection models (Higuchi et al., 2020 , Zhang et al., 2022 , R et al., 2021 ), improving performance in noisy environments (Bonet et al., 2021 ), mitigating false triggers caused by acoustically similar sounds (Chen et al., 2021 , Garg et al., 2021 ), and improving ASR performance for diverse speakers, including those with dysfluent speech or different linguistic backgrounds (Mitra et al., 2021 , Wu et al., 2020 ). Furthermore, research into conversational error recovery mechanisms, such as allowing users to repeat or reformulate commands, aims to provide pathways for users and systems to recover gracefully from misinterpretations (Nguyen et al., 2021 , Fazel-Zarandi et al., 2019 , Galbraith et al., 2023 ). Addressing issues like overcapture and truncation through improved endpoint detection is also crucial for reducing frustration associated with input processing failures (Mallidi et al., 2018 , Buddi et al., 2023 ).

In conclusion, research confirms that voice assistant errors, including the mishearing of trigger words and subsequent commands, are a significant cause of user frustration and contribute to negative user experiences and reduced trust. While aggression is not a commonly documented outcome, the psychological impacts can be more profound, particularly for users who experience differential error rates. Continued advancements in ASR, wake word detection, and dialogue management are essential to mitigate these issues and improve the usability and perceived fairness of voice assistants.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
1.
A Mixed-Methods Approach to Understanding User Trust after Voice Assistant Failures
Baughan et al., 2023 | 11 citations
Despite huge gains in performance in natural language understanding via large language models in recent years, voice assistants still often fail to meet user expectations. In this study, we conducted a mixed-methods analysis of how voice assistant failures affect users' trust in their voice assistants. To illustrate how users have experienced these failures, we contribute a crowdsourced dataset of 199 voice assistant failures, categorized across 12 failure sources. Relying on interview and survey data, we find that certain failures, such as those due to overcapturing users' input, derail user trust more than others. We additionally examine how failures impact users' willingness to rely on voice assistants for future tasks. Users often stop using their voice assistants for specific tasks that result in failures for a short period of time before resuming similar usage. We demonstrate the importance of low stakes tasks, such as playing music, towards building trust after failures.
2.
Can Voice Assistants Be Microaggressors? Cross-Race Psychological Responses to Failures of Automatic Speech Recognition
Wenzel et al., 2023 | 8 citations
Language technologies have a racial bias, committing greater errors for Black users than for white users. However, little work has evaluated what effect these disparate error rates have on users themselves. The present study aims to understand if speech recognition errors in human-computer interactions may mirror the same effects as misunderstandings in interpersonal cross-race communication. In a controlled experiment (N=108), we randomly assigned Black and white participants to interact with a voice assistant pre-programmed to exhibit a high versus low error rate. Results revealed that Black participants in the high error rate condition, compared to Black participants in the low error rate condition, exhibited significantly higher levels of self-consciousness, lower levels of self-esteem and positive affect, and less favorable ratings of the technology. White participants did not exhibit this disparate pattern. We discuss design implications and the diverse research directions to which this initial study aims to contribute.
3.
Look at Me When I Talk to You: A Video Dataset to Enable Voice Assistants to Recognize Errors
Cuadra et al., 2021 | 8 citations
People interacting with voice assistants are often frustrated by voice assistants' frequent errors and inability to respond to backchannel cues. We introduce an open-source video dataset of 21 participants' interactions with a voice assistant, and explore the possibility of using this dataset to enable automatic error recognition to inform self-repair. The dataset includes clipped and labeled videos of participants' faces during free-form interactions with the voice assistant from the smart speaker's perspective. To validate our dataset, we emulated a machine learning classifier by asking crowdsourced workers to recognize voice assistant errors from watching soundless video clips of participants' reactions. We found trends suggesting it is possible to determine the voice assistant's performance from a participant's facial reaction alone. This work posits elicited datasets of interactive responses as a key step towards improving error recognition for repair for voice assistants in a wide variety of applications.
4.
Stacked 1D convolutional networks for end-to-end small footprint voice trigger detection
Higuchi et al., 2020 | 17 citations
We propose a stacked 1D convolutional neural network (S1DCNN) for end-to-end small footprint voice trigger detection in a streaming scenario. Voice trigger detection is an important speech application, with which users can activate their devices by simply saying a keyword or phrase. Due to privacy and latency reasons, a voice trigger detection system should run on an always-on processor on device. Therefore, having small memory and compute cost is crucial for a voice trigger detection system. Recently, singular value decomposition filters (SVDFs) has been used for end-to-end voice trigger detection. The SVDFs approximate a fully-connected layer with a low rank approximation, which reduces the number of model parameters. In this work, we propose S1DCNN as an alternative approach for end-to-end small-footprint voice trigger detection. An S1DCNN layer consists of a 1D convolution layer followed by a depth-wise 1D convolution layer. We show that the SVDF can be expressed as a special case of the S1DCNN layer. Experimental results show that the S1DCNN achieve 19.0% relative false reject ratio (FRR) reduction with a similar model size and a similar time delay compared to the SVDF. By using longer time delays, the S1DCNN further improve the FRR up to 12.2% relative.
5.
WakeUpNet: A Mobile-Transformer based Framework for End-to-End Streaming Voice Trigger
Zhang et al., 2022 | 2 citations
End-to-end models have gradually become the main technical stream for voice trigger, aiming to achieve an utmost prediction accuracy but with a small footprint. In present paper, we propose an end-to-end voice trigger framework, namely WakeupNet, which is basically structured on a Transformer encoder. The purpose of this framework is to explore the context-capturing capability of Transformer, as sequential information is vital for wakeup-word detection. However, the conventional Transformer encoder is too large to fit our task. To address this issue, we introduce different model compression approaches to shrink the vanilla one into a tiny one, called mobile-Transformer. To evaluate the performance of mobile-Transformer, we conduct extensive experiments on a large public-available dataset HiMia. The obtained results indicate that introduced mobile-Transformer significantly outperforms other frequently used models for voice trigger in both clean and noisy scenarios.
6.
EfficientWord-Net: An Open Source Hotword Detection Engine based on One-shot Learning
R et al., 2021 | 3 citations
Voice assistants like Siri, Google Assistant, Alexa etc. are used widely across the globe for home automation, these require the use of special phrases also known as hotwords to wake it up and perform an action like "Hey Alexa!", "Ok Google!" and "Hey Siri!" etc. These hotwords are detected with lightweight real-time engines whose purpose is to detect the hotwords uttered by the user. This paper presents the design and implementation of a hotword detection engine based on one-shot learning which detects the hotword uttered by the user in real-time with just one or few training samples of the hotword. This approach is efficient when compared to existing implementations because the process of adding a new hotword in the existing systems requires enormous amounts of positive and negative training samples and the model needs to retrain for every hotword. This makes the existing implementations inefficient in terms of computation and cost. The architecture proposed in this paper has achieved an accuracy of 94.51%.
7.
Speech Enhancement for Wake-Up-Word detection in Voice Assistants
Bonet et al., 2021 | 10 citations
Keyword spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants. A very common issue of voice assistants is that they get easily activated by background noise like music, TV or background speech that accidentally triggers the device. In this paper, we propose a Speech Enhancement (SE) model adapted to the task of WUW detection that aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises. The SE model is a fully-convolutional denoising auto-encoder at waveform level and is trained using a log-Mel Spectrogram and waveform reconstruction losses together with the BCE loss of a simple WUW classification network. A new database has been purposely prepared for the task of recognizing the WUW in challenging conditions containing negative samples that are very phonetically similar to the keyword. The database is extended with public databases and an exhaustive data augmentation to simulate different noises and environments. The results obtained by concatenating the SE with a simple and state-of-the-art WUW detectors show that the SE does not have a negative impact on the recognition rate in quiet environments while increasing the performance in the presence of noise, especially when the SE and WUW detector are trained jointly end-to-end.
8.
FakeWake: Understanding and Mitigating Fake Wake-up Words of Voice Assistants
Chen et al., 2021 | 19 citations
In the area of Internet of Things (IoT) voice assistants have become an important interface to operate smart speakers, smartphones, and even automobiles. To save power and protect user privacy, voice assistants send commands to the cloud only if a small set of pre-registered wake-up words are detected. However, voice assistants are shown to be vulnerable to the FakeWake phenomena, whereby they are inadvertently triggered by innocent-sounding fuzzy words. In this paper, we present a systematic investigation of the FakeWake phenomena from three aspects. To start with, we design the first fuzzy word generator to automatically and efficiently produce fuzzy words instead of searching through a swarm of audio materials. We manage to generate 965 fuzzy words covering 8 most popular English and Chinese smart speakers. To explain the causes underlying the FakeWake phenomena, we construct an interpretable tree-based decision model, which reveals phonetic features that contribute to false acceptance of fuzzy words by wake-up word detectors. Finally, we propose remedies to mitigate the effect of FakeWake. The results show that the strengthened models are not only resilient to fuzzy words but also achieve better overall performance on original training datasets.
9.
Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation
Garg et al., 2021 | 9 citations
We present a unified and hardware efficient architecture for two stage voice trigger detection (VTD) and false trigger mitigation (FTM) tasks. Two stage VTD systems of voice assistants can get falsely activated to audio segments acoustically similar to the trigger phrase of interest. FTM systems cancel such activations by using post trigger audio context. Traditional FTM systems rely on automatic speech recognition lattices which are computationally expensive to obtain on device. We propose a streaming transformer (TF) encoder architecture, which progressively processes incoming audio chunks and maintains audio context to perform both VTD and FTM tasks using only acoustic features. The proposed joint model yields an average 18% relative reduction in false reject rate (FRR) for the VTD task at a given false alarm rate. Moreover, our model suppresses 95% of the false triggers with an additional one second of post-trigger audio. Finally, on-device measurements show 32% reduction in runtime memory and 56% reduction in inference time compared to non-streaming version of the model.
10.
Analysis and Tuning of a Voice Assistant System for Dysfluent Speech
Mitra et al., 2021 | 24 citations
Dysfluencies and variations in speech pronunciation can severely degrade speech recognition performance, and for many individuals with moderate-to-severe speech disorders, voice operated systems do not work. Current speech recognition systems are trained primarily with data from fluent speakers and as a consequence do not generalize well to speech with dysfluencies such as sound or word repetitions, sound prolongations, or audible blocks. The focus of this work is on quantitative analysis of a consumer speech recognition system on individuals who stutter and production-oriented approaches for improving performance for common voice assistant tasks (i.e., "what is the weather?"). At baseline, this system introduces a significant number of insertion and substitution errors resulting in intended speech Word Error Rates (isWER) that are 13.64\% worse (absolute) for individuals with fluency disorders. We show that by simply tuning the decoding parameters in an existing hybrid speech recognition system one can improve isWER by 24\% (relative) for individuals with fluency disorders. Tuning these parameters translates to 3.6\% better domain recognition and 1.7\% better intent recognition relative to the default setup for the 18 study participants across all stuttering severities.
11.
See what I'm saying? Comparing Intelligent Personal Assistant use for Native and Non-Native Language Speakers
Wu et al., 2020 | 59 citations
Limited linguistic coverage for Intelligent Personal Assistants (IPAs) means that many interact in a non-native language. Yet we know little about how IPAs currently support or hinder these users. Through native (L1) and non-native (L2) English speakers interacting with Google Assistant on a smartphone and smart speaker, we aim to understand this more deeply. Interviews revealed that L2 speakers prioritised utterance planning around perceived linguistic limitations, as opposed to L1 speakers prioritising succinctness because of system limitations. L2 speakers see IPAs as insensitive to linguistic needs resulting in failed interaction. L2 speakers clearly preferred using smartphones, as visual feedback supported diagnoses of communication breakdowns whilst allowing time to process query results. Conversely, L1 speakers preferred smart speakers, with audio feedback being seen as sufficient. We discuss the need to tailor the IPA experience for L2 users, emphasising visual feedback whilst reducing the burden of language production.
12.
User-Initiated Repetition-Based Recovery in Multi-Utterance Dialogue Systems
Nguyen et al., 2021 | 3 citations
Recognition errors are common in human communication. Similar errors often lead to unwanted behaviour in dialogue systems or virtual assistants. In human communication, we can recover from them by repeating misrecognized words or phrases; however in human-machine communication this recovery mechanism is not available. In this paper, we attempt to bridge this gap and present a system that allows a user to correct speech recognition errors in a virtual assistant by repeating misunderstood words. When a user repeats part of the phrase the system rewrites the original query to incorporate the correction. This rewrite allows the virtual assistant to understand the original query successfully. We present an end-to-end 2-step attention pointer network that can generate the the rewritten query by merging together the incorrectly understood utterance with the correction follow-up. We evaluate the model on data collected for this task and compare the proposed model to a rule-based baseline and a standard pointer network. We show that rewriting the original query is an effective way to handle repetition-based recovery and that the proposed model outperforms the rule based baseline, reducing Word Error Rate by 19% relative at 2% False Alarm Rate on annotated data.
13.
Investigation of Error Simulation Techniques for Learning Dialog Policies for Conversational Error Recovery
Fazel-Zarandi et al., 2019 | 14 citations
Training dialog policies for speech-based virtual assistants requires a plethora of conversational data. The data collection phase is often expensive and time consuming due to human involvement. To address this issue, a common solution is to build user simulators for data generation. For the successful deployment of the trained policies into real world domains, it is vital that the user simulator mimics realistic conditions. In particular, speech-based assistants are heavily affected by automatic speech recognition and language understanding errors, hence the user simulator should be able to simulate similar errors. In this paper, we review the existing error simulation methods that induce errors at audio, phoneme, text, or semantic level; and conduct detailed comparisons between the audio-level and text-level methods. In the process, we improve the existing text-level method by introducing confidence score prediction and out-of-vocabulary word mapping. We also explore the impact of audio-level and text-level methods on learning a simple clarification dialog policy to recover from errors to provide insight on future improvement for both approaches.
14.
An Analysis of Dialogue Repair in Virtual Voice Assistants
Galbraith et al., 2023 | 2 citations
Language speakers often use what are known as repair initiators to mend fundamental disconnects that occur between them during verbal communication. Previous research in this field has mainly focused on the human-to-human use of repair initiator. We proposed an examination of dialogue repair structure wherein the dialogue initiator is human and the party that initiates or responds to the repair is a virtual assistant. This study examined the use of repair initiators in both English and Spanish with two popular assistants, Google Assistant and Apple's Siri. Our aim was to codify the differences, if any, in responses by voice assistants to dialogues in need of repair as compared to human-human dialogues also in need of repair. Ultimately the data demonstrated that not only were there differences between human-assistant and human-human dialogue repair strategies, but that there were likewise differences among the assistants and the languages studied.
15.
Device-directed Utterance Detection
Mallidi et al., 2018 | 48 citations
In this work, we propose a classifier for distinguishing device-directed queries from background speech in the context of interactions with voice assistants. Applications include rejection of false wake-ups or unintended interactions as well as enabling wake-word free follow-up queries. Consider the example interaction: $"Computer,~play~music", "Computer,~reduce~the~volume"$. In this interaction, the user needs to repeat the wake-word ($Computer$) for the second query. To allow for more natural interactions, the device could immediately re-enter listening state after the first query (without wake-word repetition) and accept or reject a potential follow-up as device-directed or background speech. The proposed model consists of two long short-term memory (LSTM) neural networks trained on acoustic features and automatic speech recognition (ASR) 1-best hypotheses, respectively. A feed-forward deep neural network (DNN) is then trained to combine the acoustic and 1-best embeddings, derived from the LSTMs, with features from the ASR decoder. Experimental results show that ASR decoder, acoustic embeddings, and 1-best embeddings yield an equal-error-rate (EER) of $9.3~\%$, $10.9~\%$ and $20.1~\%$, respectively. Combination of the features resulted in a $44~\%$ relative improvement and a final EER of $5.2~\%$.
16.
Efficient Multimodal Neural Networks for Trigger-less Voice Assistants
Buddi et al., 2023 | 4 citations
The adoption of multimodal interactions by Voice Assistants (VAs) is growing rapidly to enhance human-computer interactions. Smartwatches have now incorporated trigger-less methods of invoking VAs, such as Raise To Speak (RTS), where the user raises their watch and speaks to VAs without an explicit trigger. Current state-of-the-art RTS systems rely on heuristics and engineered Finite State Machines to fuse gesture and audio data for multimodal decision-making. However, these methods have limitations, including limited adaptability, scalability, and induced human biases. In this work, we propose a neural network based audio-gesture multimodal fusion system that (1) Better understands temporal correlation between audio and gesture data, leading to precise invocations (2) Generalizes to a wide range of environments and scenarios (3) Is lightweight and deployable on low-power devices, such as smartwatches, with quick launch times (4) Improves productivity in asset development processes.
Are there any studies that refer to the phenomenon of criminal activity being something that folks start off with lower stakes deviant behavior, such as serial killers who start off hurting animals? Could there be a risk in VBA users getting frustrated and engaging in low risk aggression at the VBA, and eventually that seeping into other areas of their life where it impacts their interpersonal relationships?
Done in 1m 40s
Queued
Line Arrow Right 1 Streamline Icon: https://streamlinehq.com
The user is asking about the phenomenon of criminal activity starting with lower-stakes deviant behavior and whether frustration with voice assistants (VBAs) and subsequent low-risk aggression towards them could potentially escalate and affect interpersonal relationships. Given the available tools are focused on searching and analyzing arXiv papers (primarily computer science), the first part of the question about criminal behavior progression is likely outside the scope of what I can find directly. However, the second part, regarding user frustration and aggression towards VBAs and potential transfer to interpersonal relationships, relates to Human-Computer Interaction (HCI) and psychology, which might be covered in relevant arXiv papers. I will perform two parallel searches on arXiv: one for 'user frustration voice assistant' to address the core issue of negative user experiences with VBAs and another for 'aggression human computer interaction' to see if the concept of aggression towards technology or in HCI contexts is discussed. This will help determine if there are studies that touch upon the user's concern about the potential for low-risk aggression towards VBAs having broader impacts.
Searching arXiv for "user frustration voice assistant"
Searching arXiv for "aggression human computer interaction"
Line Arrow Right 1 Streamline Icon: https://streamlinehq.com
I have identified a paper (1808.01076) from the previous search that directly addresses the potential link between aggression towards artificial agents (robots) and broader antisocial tendencies, which is highly relevant to the user's second question about aggression towards VBAs impacting interpersonal relationships. While the first part of the user's question about criminal behavior progression is likely outside the scope of arXiv, understanding the findings of paper 1808.01076 is crucial for the second part. Therefore, I will use the paper prompter tool to get more detailed information about this paper.
Prompting "Verbal Disinhibition towards Robots is Associated with General Antisociality": "Summarize the methodology and key findings of this paper regarding the link between verbal disinhibition/aggression towards artificial agents (robots) and general antisociality in broader interactions. Does it suggest a causal link or a correlation?" (1808.01076)
Line Arrow Right 1 Streamline Icon: https://streamlinehq.com
I have gathered information about user frustration with voice assistants and explored research on aggression in human-computer interaction, specifically focusing on a paper that correlates aggression towards robots with general antisocial behavior. I need to synthesize this information to answer the user's query, acknowledging the limitations of the search to the arXiv database and the specificity of the user's hypothetical scenario.
Line Arrow Right 1 Streamline Icon: https://streamlinehq.com
Synthesized final response with models/gemini-2.5-flash-preview-04-17
Line Arrow Right 1 Streamline Icon: https://streamlinehq.com
We ran into an issue with gemini-2.5-flash-preview-04-17, retrying with gemini-2.5-pro-preview-03-25...
Line Arrow Right 1 Streamline Icon: https://streamlinehq.com
Synthesized final response with models/gemini-2.5-pro-preview-05-06
Done in 1m 40s
Line Arrow Right 1 Streamline Icon: https://streamlinehq.com
Processed 130,995 tokens in 4 steps

The query poses two distinct questions: first, concerning the "gateway hypothesis" in criminology—the idea that engagement in lower-stakes deviant behavior can precede more serious criminal acts, exemplified by serial killers purportedly starting with animal cruelty. Second, it inquires about the potential risk that frustration stemming from Voice-Based Assistants (VBAs) mishearing trigger words, leading to low-risk aggression towards the VBA, might escalate or transfer, thereby adversely impacting users' interpersonal relationships.

The provided research corpus, primarily sourced from arXiv, focuses on Human-Computer Interaction (HCI), user experiences with voice assistants, and aggression in digital contexts. As such, it can substantially address the second part of your query. However, the criminological "gateway hypothesis" falls outside the typical scope of this research domain.

The "Gateway Hypothesis" in Criminology

The specific criminological theory regarding a progression from lower-stakes deviant behaviors (e.g., animal cruelty) to more severe criminal acts (e.g., serial homicide) is a complex subject extensively studied within criminology, psychology, and forensic science. The available arXiv research, which centers on computational and HCI-related topics, does not contain studies directly investigating or validating this criminological hypothesis. Answering this part of your query would necessitate a review of literature from those specialized fields, which is beyond the purview of the current dataset.

User Frustration and Negative Affect from Voice-Based Assistant (VBA) Malfunctions

Research extensively documents that malfunctions in VBAs, including mishearing trigger words or commands, are a significant source of user frustration and can lead to a range of negative psychological outcomes.

A mixed-methods paper (Baughan et al., 2023 ) investigating user trust after voice assistant failures identified various failure types, including "Perception" failures, which encompass the system incorrectly capturing user input. These include issues like noisy channels, overcapture (system listening too long), truncation (system cutting off input too early), and transcription errors (mishearing words). The paper found that Overcapture failures were particularly detrimental, resulting in the lowest reported trust scores concerning the VBA's perceived ability and benevolence. Users qualitatively described Overcapture as "aggravating," "annoying," and a "waste of time." Transcription errors, where the system mishears words or struggles with speech variations (e.g., accents, names, foreign languages), also negatively impacted perceptions of benevolence and contributed to user "frustration" and "annoyance." While not impacting perceived ability as severely as "Incorrect Action" failures, transcription errors were more damaging to trust than errors due to noisy channels or truncation. Truncation errors were similarly described as "aggravating" and "annoying," increasing task completion time. The cumulative effect of these Perception failures often led users to temporarily abandon tasks or simplify their interactions to avoid error-prone scenarios.

Furthermore, research into cross-race psychological responses to failures in Automatic Speech Recognition (ASR) systems (Wenzel et al., 2023 ) highlights more profound impacts. Given that ASR can exhibit disparate error rates across demographic groups (e.g., higher for Black speakers than white speakers), this paper explored the psychological effects. Black participants interacting with a high-error-rate voice assistant reported significantly lower positive affect, higher self-consciousness, and reduced individual and collective self-esteem compared to Black participants in a low-error-rate condition. These findings were interpreted as consistent with the psychological impact of experiencing racial microaggressions, where persistent system errors are perceived as subtle acts of bias reinforcing marginalization. While the paper used a Negative Affect scale (which includes frustration-related items), it underscores that the psychological toll of ASR failures can extend beyond simple frustration to impact self-perception and emotional well-being, particularly when these failures intersect with societal biases.

These studies confirm that VBA malfunctions are a potent source of negative user experience, leading to frustration, reduced trust, and, in some contexts, more complex adverse psychological responses.

Aggression Towards Artificial Agents and its Relation to Broader Antisocial Tendencies

The question of whether aggression directed towards technology, such as VBAs or robots, might correlate with or indicate broader antisocial tendencies has been explored. One paper, "Verbal Disinhibition towards Robots is Associated with General Antisociality" (Strait et al., 2018 ), directly investigated this.

Methodology:

The researchers aimed to determine if verbal aggression towards robots was an isolated phenomenon or part of a larger pattern of antisocial behavior. They used Twitter as a data source for unsupervised human-agent interactions, focusing on two high-profile robots with Twitter accounts (Bina48 and Sophia). Forty independent Twitter users were selected: 20 who had posted at least one "abusive" tweet directed at one of the robots and 20 who had posted "non-abusive" tweets at them. "Abusive" content was defined as dehumanizing material, including objectification, sexualization, racist remarks, or generally offensive comments (e.g., calling the robot stupid, expressing violent/hostile intent). For each of these 40 users, 50 additional tweets (25 before and 25 after the target robot-directed tweet) were collected, totaling 2,000 tweets for analysis. These general tweets were then coded for abusiveness using the same criteria to determine the frequency of abuse in each user's broader Twitter communication. An ANOVA was conducted to compare the frequency of abusive content in general tweets between the "abusive towards robots" group and the "non-abusive towards robots" group.

Key Findings:

The paper found a significant association: users who were abusive towards the robots were significantly more frequently abusive in their general tweeting (M=.15M=.15 frequency of abuse in general tweets) compared to users who were non-abusive towards the robots (M=.03M=.03 frequency of abuse in general tweets). There was a significant main effect of "user type" on the frequency of dehumanizing content in users' broader Twitter communications.

Nature of the Link (Correlation vs. Causation):

It is critical to note that this paper demonstrates a correlation (or association), not a causal link. The research design was observational; it identified individuals based on pre-existing behavior (abusing a robot) and then observed their other behaviors (general tweeting). This methodology can identify that two behaviors co-occur or are linked but cannot establish that one causes the other. The authors themselves discuss that the observed association could stem from aggression towards robots being linked to a more stable antisocial personality trait or resulting from a temporary state of general negative affect. Both interpretations suggest an underlying factor that is correlated with both types of abusive behavior, rather than aggression towards robots directly causing or leading to broader antisocial tweeting.

Examining the Potential for Escalation and Impact on Interpersonal Relationships

The core of your second question is whether frustration-induced, low-risk aggression towards a VBA could escalate or transfer ("seep into other areas"), negatively impacting interpersonal relationships. Based on the provided research, there is no direct evidence to support this specific escalatory pathway.

The paper by Nagamine et al. (Strait et al., 2018 ) suggests that individuals exhibiting verbal aggression towards robots may already possess broader antisocial tendencies. This implies that aggression towards technology might be another manifestation of a pre-existing disposition, rather than the technology interaction serving as a catalyst or training ground that creates or escalates aggression which then transfers to human interactions. The paper did not investigate whether interacting aggressively with a robot causes an increase in subsequent interpersonal aggression or if this behavior starts with technology and then spreads.

While research clearly shows that VBA malfunctions cause user frustration and negative affect (Baughan et al., 2023 , Wenzel et al., 2023 ), the leap from this frustration to overt, low-risk aggression towards the VBA, and then a subsequent escalation and transfer of this aggression to interpersonal relationships, is a multi-step hypothesis not directly substantiated by the available studies. The psychological impacts noted in (Wenzel et al., 2023 ) (e.g., lower self-esteem, negative affect) are significant and could conceivably have indirect ramifications on an individual's mood and well-being, which might, in turn, affect their interactions. However, a direct causal chain leading to increased interpersonal aggression as a learned or escalated behavior from VBA interactions is not established in this corpus.

The frustration experienced by users is a response to system failure. While this frustration is a negative emotional state, the current research does not offer evidence that venting this frustration on a VBA acts as a "gateway" behavior that then cultivates or normalizes aggression in human-to-human contexts. It is more plausible, based on (Strait et al., 2018 ), that individuals with pre-existing aggressive or antisocial tendencies might express these tendencies towards various targets, including technology.

Another paper ((Carlson et al., 2017 ), "This robot stinks! Differences between perceived mistreatment of robot and computer partners") found that human observers perceived mistreatment directed by a confederate towards a robot differently than towards a computer, feeling more sympathy for the robot and believing it to be more emotionally capable. This suggests that humans may ascribe some level of social presence or animacy to robots, which could influence reactions to their "mistreatment," but it doesn't directly address user-initiated frustration-aggression cycles and their transference.

Conclusion

In summary, while the criminological "gateway hypothesis" is outside the scope of the provided HCI-focused research, studies on voice-based assistants robustly confirm that system malfunctions, such as mishearing trigger words, are a significant source of user frustration and can lead to various negative psychological states. Research has also found a correlation between verbal aggression directed at artificial agents (like robots on social media) and a higher propensity for abusive language in users' general online communications. However, this is an association, suggesting that such behaviors may stem from common underlying antisocial traits rather than indicating a causal pathway where aggression towards technology trains or escalates into interpersonal aggression. The specific hypothesis that frustration-induced, low-risk aggression towards a VBA could directly lead to an escalation of aggressive behaviors impacting interpersonal relationships is not directly supported by the current body of research presented.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (4)
1.
A Mixed-Methods Approach to Understanding User Trust after Voice Assistant Failures
Baughan et al., 2023 | 11 citations
Despite huge gains in performance in natural language understanding via large language models in recent years, voice assistants still often fail to meet user expectations. In this study, we conducted a mixed-methods analysis of how voice assistant failures affect users' trust in their voice assistants. To illustrate how users have experienced these failures, we contribute a crowdsourced dataset of 199 voice assistant failures, categorized across 12 failure sources. Relying on interview and survey data, we find that certain failures, such as those due to overcapturing users' input, derail user trust more than others. We additionally examine how failures impact users' willingness to rely on voice assistants for future tasks. Users often stop using their voice assistants for specific tasks that result in failures for a short period of time before resuming similar usage. We demonstrate the importance of low stakes tasks, such as playing music, towards building trust after failures.
2.
Can Voice Assistants Be Microaggressors? Cross-Race Psychological Responses to Failures of Automatic Speech Recognition
Wenzel et al., 2023 | 8 citations
Language technologies have a racial bias, committing greater errors for Black users than for white users. However, little work has evaluated what effect these disparate error rates have on users themselves. The present study aims to understand if speech recognition errors in human-computer interactions may mirror the same effects as misunderstandings in interpersonal cross-race communication. In a controlled experiment (N=108), we randomly assigned Black and white participants to interact with a voice assistant pre-programmed to exhibit a high versus low error rate. Results revealed that Black participants in the high error rate condition, compared to Black participants in the low error rate condition, exhibited significantly higher levels of self-consciousness, lower levels of self-esteem and positive affect, and less favorable ratings of the technology. White participants did not exhibit this disparate pattern. We discuss design implications and the diverse research directions to which this initial study aims to contribute.
3.
Verbal Disinhibition towards Robots is Associated with General Antisociality
Strait et al., 2018 | 4 citations
The emergence of agentic technologies (e.g., robots) in increasingly public realms (e.g., social media) has revealed surprising antisocial tendencies in human-agent interactions. In particular, there is growing indication of people's propensity to act aggressively towards such systems - without provocation and unabashedly so. Towards understanding whether this aggressive behavior is anomalous or whether it is associated with general antisocial tendencies in people's broader interactions, we examined people's verbal disinhibition towards two artificial agents. Using Twitter as a corpus of free-form, unsupervised interactions, we identified 40 independent Twitter users who tweeted abusively or non-abusively at one of two high-profile robots with Twitter accounts (TMI's Bina48 and Hanson Robotics' Sophia). Analysis of 50 of each user's tweets most proximate to their tweet at the respective robot (N=2,000) shows people's aggression towards the robots to be associated with more frequent abuse in their general tweeting. The findings thus suggest that disinhibition towards robots is not necessarily a pervasive tendency, but rather one driven by individual differences in antisociality. Nevertheless, such unprovoked abuse highlights a need for attention to the reception of agentic technologies in society, as well as the necessity of corresponding capacities to recognize and respond to antisocial dynamics.
4.
This robot stinks! Differences between perceived mistreatment of robot and computer partners
Carlson et al., 2017 | 6 citations
Robots (and computers) are increasingly being used in scenarios where they interact socially with people. How people react to these agents is telling about the perceived animacy of such agents. Mistreatment of robots (or computers) by co-workers might provoke such telling reactions. The purpose of this study was to discover if people perceived mistreatment directed towards a robot any differently than toward a computer. This will provide some understanding of how people perceive robots in collaborative social settings. We conducted a between-subjects study with 80 participants. Participants worked cooperatively with either a robot or a computer which acted as the "recorder" for the group. A confederate either acted aggressively or neutrally towards the "recorder." We hypothesized that people would not socially accept mistreatment towards an agent that they felt was intelligent and similar to themselves; that participants would perceive the robot as more similar in appearance and emotional capability to themselves than a computer; and would observe more mistreatment. The final results supported our hypothesis; the participants observed mistreatment in the robot, but not the computer. Participants felt significantly more sympathetic towards the robot and also believed that it was much more emotionally capable.