AI-READI Clinical Dataset

Updated 28 January 2026

The AI-READI Clinical Dataset is a comprehensive multimodal resource featuring high-temporal resolution CGM and wearable sensor data for real-time glycemic forecasting.
It integrates continuous glucose monitoring, physical activity metrics, and detailed participant metadata from healthy, prediabetic, and T2DM cohorts.
Benchmark studies using the AttenGluco model show significant RMSE and MAE improvements, underscoring its potential for enhancing clinical decision support systems.

The AI-READI Clinical Dataset is a large-scale, multimodal, and high-temporal-resolution resource enabling advanced research in glycemic forecasting, wearable sensor fusion, and cohort-based analysis for metabolic disease. Jointly collected by the AI-READI Consortium across three U.S. clinical sites, this flagship dataset comprises high-quality continuous glucose monitoring (CGM) data, physical activity measures, and extensive participant metadata spanning healthy, prediabetic, and type 2 diabetes (T2DM) populations. Its design, preprocessing, and benchmark applications provide a reproducible platform for AI in digital health and clinical time-series modeling (Farahmand et al., 14 Feb 2025).

1. Dataset Composition and Cohort Structure

The AI-READI dataset’s analytic cohort contains 896 adult participants partitioned across four diagnostic bins:

Cohort	N
Healthy controls	323
Pre-diabetes	207
T2DM (oral agents)	258
T2DM (insulin users)	108

Initial enrollment comprised 1,067 subjects, with exclusion of participants exhibiting excessive missingness, ensuring analytic integrity. Recruitment produced a balanced representation by sex, race, and disease severity. The dataset is thus well-suited to comparative machine learning across metabolic phenotypes and therapeutic regimens (Farahmand et al., 14 Feb 2025).

2. Data Modalities, Temporal Resolution, and Preprocessing

Two primary modalities are included in the scope of published AI benchmark studies:

Continuous Glucose Monitor (CGM): Dexcom G6 devices sampled capillary-equivalent glucose values ( $\mathbf{x}_g$ ) at uniform 5-minute intervals, providing dense coverage of diurnal glycemic variation.
Physical Activity Tracker: Garmin Vivosmart 5 supplied both step count ( $\mathbf{x}_{ws}$ ) and extracted walking intervals ( $\mathbf{x}_{wi}$ ), with activity data characterized by irregular sampling due to device off-cycles or transient loss of contact.

Additional sensor channels—such as heart-rate variability, environmental factors, and stress indices—are present in the raw dataset but were not utilized in the initial AttenGluco benchmark. Data preprocessing employed the following sequence:

Imputation: Occasional CGM gaps (5–10 minutes) were linearly interpolated; short duration activity dropout was imputed by forward-fill, while extended absences were coded as zero activity.
Normalization: Z-score normalization ( $z = (x - \mu)/\sigma$ ) was applied on a per-subject, per-channel basis.

This approach ensures regularity and comparability across subjects for transformer-based time-series modeling (Farahmand et al., 14 Feb 2025).

3. Metadata, Clinical Variables, and Curation Workflow

Each case includes structured metadata:

Cohort assignment: healthy, pre-diabetes, T2DM on oral therapy, or T2DM on insulin.
Demographics: age, sex, ethnicity, and cohort-specific identifiers.
Device logs: CGM and wearable-derived signal completeness.

The curation process involved exclusion of subjects with excessive missing sensor data, outliers, or synchronization failures between modalities, resulting in a robust clinical time-series cohort (Farahmand et al., 14 Feb 2025).

4. Machine Learning Benchmarks and Evaluation Protocols

AttenGluco, a cross-modal transformer architecture, serves as the principal published benchmark exploiting the AI-READI dataset (Farahmand et al., 14 Feb 2025). The evaluation framework encompasses several clinically relevant forecasting scenarios:

Sliding-window input: 400 minutes (80 CGM steps) history per prediction.
Prediction horizons: 5, 30, and 60 minutes into the future ( $m \in \{1,6,12\}$ ).
Loss function: Mean Squared Error (MSE).
Splits:
- Isolated subject: 85% train, 15% test, model re-initialized per subject.
- Cohort-wise fine-tuning: Sequential training/fine-tuning across subjects within a cohort, evaluating adaptation.
- Continual learning: Sequential fine-tuning across clinical cohorts without reinitialization, quantifying catastrophic forgetting.

Primary Performance Metrics

Metric	Formula
RMSE	$\sqrt{\tfrac{1}{N}\sum_{i=1}^N(\hat y_i - y_i)^2}$
MAE	$\tfrac{1}{N}\sum_{i=1}^N\|\hat y_i - y_i\|$
Pearson correlation	$r=\frac{\sum_i(\hat y_i-\bar{\hat y})(y_i-\bar y)}{\sqrt{\sum_i(\hat y_i-\bar{\hat y})^2\sum_i(y_i-\bar y)^2}}$

AttenGluco exhibits consistent improvement over a multimodal 1D-CNN+LSTM baseline across all cohorts and tasks. For example, RMSE reductions of 9.1%–13.2% and MAE reductions of 9.7%–14.6% were observed cohort-wise, with substantive correlation increases (e.g., 0.59→0.67 in T2DM on insulin) (Farahmand et al., 14 Feb 2025).

5. Multimodal Fusion and Architectural Considerations

The dataset’s fusion potential arises from CGM’s dense regular sampling and activity’s sporadic/event-based reporting. AttenGluco addresses this through a dual-branch cross-attention mechanism, treating CGM as queries and walking activity as keys/values, enabling identification of temporal dependencies between physical activity and glycemic excursions. A multi-scale self-attention block captures both short- and long-term sequence motifs, supporting robust forecasting up to 60 minutes ahead with mitigated RMSE degradation compared to LSTM-based approaches (Farahmand et al., 14 Feb 2025).

This suggests the AI-READI dataset is optimal for investigating novel transformer-based fusion architectures and temporal reasoning under irregular multimodal sampling.

6. Clinical and Research Implications

The dataset’s detailed clinical stratification and high-fidelity time-series enable research into real-time glycemic control, closed-loop insulin delivery, and patient-specific lifestyle interventions. The observed performance of AttenGluco—with RMSE and MAE reductions of 10–15% compared to LSTM baselines—has implications for improving the safety and reliability of prediction-driven clinical decision support systems. Cohort-wise fine-tuning demonstrates adaptation capacity but highlights catastrophic forgetting when sequentially progressing through diverse metabolic populations, underlining future work in continual learning regularization (Farahmand et al., 14 Feb 2025).

A plausible implication is that advanced methods leveraging this dataset can support individualized therapeutic regimens and alert systems by accurately forecasting glucose perturbations in real-world, wearable-dominated contexts.

7. Access, Licensing, and Research Usage

The AI-READI Flagship Dataset, version 2.0.0, is publicly downloadable for bona fide research under open-access terms as detailed in benchmark publications (Farahmand et al., 14 Feb 2025). It is intended as a reproducible foundation for the training, validation, and benchmarking of multimodal AI in digital health, with recommendations including subject-wise train/validation splits, careful cross-cohort evaluation, and modality-specific normalization. The dataset structure, quality controls, and accompanying analytic code establish a model for future clinical-grade, sensor-rich AI resources in endocrinology and beyond.

Markdown Report Issue Upgrade to Chat

References (1)

AttenGluco: Multimodal Transformer-Based Blood Glucose Forecasting on AI-READI Dataset (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AI-READI Clinical Dataset.