Position-Independent Clicking Sounds
- Position-independent clicking sounds are nonverbal, transient signals produced by mouth or teeth clicks that operate independently of speech context.
- Audio-based detection uses temporal convolutional networks and log-mel spectrograms to achieve high precision with minimal false positives.
- Vibration-based methods like STEALTHsense utilize accelerometer data and neural broadcasting for real-time, hands-free click detection in smart glasses.
Position-independent clicking sounds are nonverbal acoustic or vibrational signals often generated by the mouth or teeth that can be reliably detected and employed as control events in human-computer interaction, regardless of their phonetic or anatomical origin. These signals—such as tongue clicks or teeth clicks—exhibit distinct transient signatures and are characterized by their lack of dependence on speech context or production location. Two prominent methodologies for their detection are grounded in recent work: (1) audio-based temporal convolutional modeling for mouth clicks in voice-assistive applications (Lea et al., 2022), and (2) vibration-based neural modeling for teeth clicks as hands-free controls for smart glasses (Mohapatra et al., 21 Aug 2024). Both paradigms prioritize robustness to user variation, ambient interference, and signal position within various behavioral sequences.
1. Signal Properties of Position-Independent Clicking Sounds
Position-independent clicking sounds may be produced via tongue snaps, teeth contacts, or similar impulsive gestures, resulting in brief, broadband transients in audio or vibrational sensor modalities. These events typically have durations between 15 ms and 100 ms and exhibit spectral peaks and sharp temporal onset. Crucially, unlike speech phonemes, their occurrence is not constrained by linguistic context, articulatory location, or anatomical factors, permitting encoding of intent independently of verbal behavior.
Distinct classes include:
- Mouth clicks (e.g., tongue snap, palate slap)
- Teeth clicks (e.g., dental occlusion, single/double clicks) Measurement modalities are either acoustic (microphone-derived log-mel spectrograms) or vibrotactile (accelerometer-derived spectral–temporal features).
2. Model Architectures for Detection
Audio-based Detection (Voice Assistants)
The approach for detecting mouth-generated clicking sounds employs a Temporal Convolutional Network (TCN) inspired by QuartzNet (Lea et al., 2022). Pipeline details:
Step | Description | Output |
---|---|---|
Audio Preproc | Log-mel spectrogram (64-d), 100Hz frames | |
TCN Backbone | Stacked Conv1D (kernel 5, grouped, residual) | Per-frame probability () |
Output Layer | Sigmoid, classes (incl. click, speech, bg) | |
Post-processing | Event detection: threshold + temporal gating | Click events |
A click event is defined for class by:
- If for all ,
- and in past $50$ frames ,
- then generate event (click, ).
Typical hyperparameters: , frames (70 ms), receptive field 270 ms.
Vibration-based Detection (Smart Glasses)
Detection of teeth clicks using STEALTHsense is anchored in a temporal broadcasting-based neural network (Mohapatra et al., 21 Aug 2024). Pipeline:
Step | Description | Output |
---|---|---|
Filtering | Notch (60Hz), bandpass (300Hz–5kHz) | Denoised signals |
Feature Ext | 13 log-mel + , + zcr + energy | $41$-dim features |
NN Architecture | Temporal encoder + broadcasting | Pattern probabilities |
Output Layer | Classifier on broadcasted features | Click/no-click pattern |
Broadcasting operation:
where and is temporally summarized.
STEALTHsense achieves computational efficiency (88K parameters, 7.14MMAC/inference), allowing real-time operation on embedded devices.
3. Robustness to Position, User, and Environmental Variation
Both audio and vibration-based methods are evaluated on datasets incorporating extensive user variation:
- Mouth click detection: 710 speakers, diverse ages, device types, recording distances; augmented with speech and background “aggressor” data (Lea et al., 2022).
- Teeth click detection: 21 participants, unique dental anatomies; robust training includes both click and no-click patterns with speech, chewing, and motion artifacts (Mohapatra et al., 21 Aug 2024).
Model generalization is tested via cross-user and cross-environment evaluation:
- For audio approaches, segment-level precision and recall are 88.6%/88.4%, with near-zero false positives for clicks in speech corpora (LibriSpeech).
- For STEALTHsense, balanced accuracy for click detection reaches 0.93 (clean) and 0.91 (noisy), outperforming SVC/XGBoost (0.74–0.76).
Data augmentation (noise, gain, temporal shift) contributes 5% improvement on “hard” samples.
4. Personalization for Improved Detection
To optimize detection for users whose signal production deviates from the population norm:
- Audio-based models: Personalization leverages 256-d embeddings from the pre-trained TCN. Few-shot learning (1–5 samples) fine-tunes the final classifier layer using frame-wise binary cross-entropy. This procedure improved F1 scores by over 60% for clicks and similar short events on failed generic cases (Lea et al., 2022).
- STEALTHsense: Future work is proposed for few-shot and personalized fitting, to handle dental and click style diversity more effectively (Mohapatra et al., 21 Aug 2024).
Attempts to employ meta-learning paradigms (MAML, ProtoNets) did not yield additional gains in the audio approach, suggesting further investigation is needed.
5. Practical Applications and Real-world Impact
Accessibility (Voice Assistants)
Position-independent click detection enables alternative input mechanisms for those with speech disorders, stuttering, or limited motor control. The described system supports access to existing technology platforms by enabling click-based segment selection and command input (Lea et al., 2022). Precision and recall make it viable for routine interaction; false positive rates are negligible on typical speech data.
Smart Glasses Control
STEALTHsense is applied for discreet, hands-free interaction. Live user tests demonstrated robust music playback control via single/double teeth clicks; user approval rates were (adoption score $3.74/5$) and rated accuracy at $3+$ on a $5$-point scale (Mohapatra et al., 21 Aug 2024). Balanced accuracy remains high in presence of noise and motion.
Authentication and Extension
Preliminary analyses reveal person-identification accuracy of 0.94 for vibration patterns in the STEALTHsense context. However, users express reservations regarding their use for sensitive functions (e.g., financial authentication), highlighting privacy considerations and need for federated learning.
6. Limitations, Challenges, and Prospects
Limitations
- Overlap with everyday sounds: Clicks may resemble non-target transients; training “aggressor” classes reduces false positives from 304 to 4.65/hr (speech test data) (Lea et al., 2022).
- Temporal alignment: Precise detection with minimal latency (e.g., ms) is achieved by extending segment label boundaries to cover click onsets.
- Position-independence: Ensuring invariance to arbitrary sequence position is challenging; click events may occur solo or within running speech. Additional post-processing (e.g., enforced silence interval post-event) could further reduce spurious triggers, at some latency cost (Lea et al., 2022).
Prospects
Future directions encompass:
- Improved personalization via meta-learning and few-shot adaptation.
- Adaptive post-processing (thresholding, context-based gating).
- Open-set learning to register new gesture patterns incrementally (Mohapatra et al., 21 Aug 2024).
- Extension across languages, dental anatomies, and specific involuntary conditions (e.g., bruxism).
- Privacy-preserving identification and signal processing via federated models.
7. Comparative Evaluation and Generalization
Both approaches are quantitatively benchmarked:
Method | Balanced Accuracy | Parameters | Modality |
---|---|---|---|
TCN (audio clicks) | 88.6%/88.4% (P/R) | n/a | Microphone |
STEALTHsense | 0.93 (clean) | 88K | Accelerometer |
SVC/XGBoost | 0.74–0.76 | n/a | Accelerometer |
STEALTHsense’s temporal broadcasting outperforms feature-axis broadcasting and attention-enhanced models (which increase complexity without noise robustness benefits). Loss functions for learning are cross-entropy based, e.g.,
where is batch size.
A plausible implication is that temporal context modeling and gating are necessary for both high precision and low false positive rates in position-independent click detection across signal types.
Position-independent clicking sound detection now underpins alternative interaction paradigms for accessibility technology and emerging smart glass platforms. Both audio and vibration-based approaches demonstrate that robust, user-invariant click detection is feasible at scale, with ongoing research focusing on personalization, adaptability, and privacy in diverse real-world contexts.