- The paper presents a self-attention transformer model that synthesizes wearable data by predicting next-day metrics with high cosine similarity and DTW accuracy.
- It employs aggregated FitBit data and autoregressive predictions to generate multi-modal health metrics including heart rate, sleep, and step counts.
- The approach offers a practical solution for overcoming data scarcity and privacy issues, enhancing simulation and tool development in healthcare research.
Synthetic Generation of Wearable Health Data Using Self-Attention Models
Introduction
Healthcare research is critically dependent on high-quality health data, which is often scarce and costly to obtain. The recent work by Evidation Healthinsson and Sage Bionetworks introduces an innovative approach to this challenge by generating synthetic wearable data. Leveraging a multi-task self-attention model, the paper demonstrates the generation of realistic data encompassing resting heart rate, sleep, and step counts, garnered from consumer wearables.
Methodology
Data Preparation
The foundation of this model is a dataset sourced from the DiSCover Project, featuring day-level data from FitBit trackers worn by 10,000 participants over a year. Pre-processing involved aggregating minute-level data to daily summaries, handling missing data through imputation, and encoding the continuous variables for modeling.
Model Architecture
The core of the proposed synthetic data generator is a transformer architecture adapted for time-series data. This model comprises decoder layers fine-tuned for autoregressive tasks, employing self-attention mechanisms. Its design accommodates the generation of multi-modal data (heart rate, steps, sleep) by predicting future activity based on historical data, a task facilitated by causally masking future information in the training phase.
Training involved comparative analysis across models varying in the amount of training data, demonstrating the benefits of larger datasets for model performance. The generation of new sequences was executed through autoregressive prediction, utilizing a prompt sequence from a held-out set to kickoff prediction.
Results
Evaluation Metrics
The paper assesses the model's performance using several measures:
- Prediction accuracy against real-world data
- Visual comparison of real and synthetically generated sequences
- Quantitative assessment through cosine similarity and dynamic time warping (DTW) distances
- Distribution analysis on a UMAP manifold
Key Findings
The model exhibited strong performance, especially when trained with the full dataset, showing notable improvement in predictive accuracy for next-day activity metrics. Visual and quantitative comparisons confirm the generated data's realism, with similarity scores approaching those within real data sets. Furthermore, the manifold analysis illustrated that synthetic sequences align closely with the real data's distribution, albeit with some discrepancies in density likely attributing to sampling bias.
Implications and Future Directions
Practical Applications
Synthetic data generation holds promise for healthcare research, offering a pathway to overcome data scarcity and privacy concerns. This approach supports paper simulations, tool development, and the exploration of rare conditions through generated datasets. Moreover, it allows for privacy-compliant testing across various research environments.
Theoretical Contributions
This paper underscores the potential of transformers in synthesizing wearable data, contributing to the broader field of generative models in healthcare. By demonstrating the feasibility and effectiveness of this approach, it paves the way for future advancements in synthetic data research.
Future Research
Potential directions include enhancing the model to generate data conditional on specific attributes, scaling the model with more extensive datasets, and instituting provable privacy guarantees. The development of standardized benchmarks for evaluating synthetic data quality in healthcare also presents an area for further exploration.
Conclusion
The creation of a self-attention model for generating synthetic wearable data represents a significant stride in addressing the challenges of health data scarcity and privacy. By leveraging comprehensive training data and sophisticated modeling techniques, this work offers a foundation for future innovations in synthetic data generation and its application in health research and beyond.