Actigraphy Encoder Overview

Updated 1 January 2026

Actigraphy encoder is a parameterized mapping that converts raw wearable time series data into compressed, interpretable features for health phenotyping and behavioral analysis.
Probabilistic models like BCHMM and deep neural architectures such as convolutional autoencoders and CNN-LSTMs enable robust state estimation and predictive classification.
Transformer-based and multimodal pipelines extend these methods by leveraging self-supervised learning and scalable feature extraction for comprehensive circadian and clinical analysis.

An actigraphy encoder is a parameterized mapping that transforms raw actigraphy time series—typically derived from wearable accelerometer devices—into tractable, compressed, or interpretable feature representations suitable for downstream analysis, classification, or generative modeling. Actigraphy encoders have evolved from classical probabilistic models, such as hidden Markov models, to deep neural architectures including convolutional autoencoders and foundation-level masked transformers. Their outputs can encode circadian structure, behavioral state, and predictive signatures for health phenotyping.

1. Probabilistic Encoders and the Bayesian Circadian HMM

The Bayesian Circadian Hidden Markov Model (BCHMM) establishes a generative encoding for 24-hour actigraphy by positing count-valued observations $x_1,\ldots,x_T$ as conditional on a $K$ -state hidden Markov chain $z_t\in\{1,\ldots,K\}$ , where each hidden state represents discrete activity modes (e.g., rest, moderate, high). The encoding consists of filtered state probabilities: $\gamma_t(k) = P(z_t = k \mid x_{1:T}),$ which summarize the instantaneous likelihood of being in each latent activity state at time $t$ .

The BCHMM specifically incorporates circadian structure by modeling transition probabilities as time-varying sinusoidal functions: $\alpha_{ij}(t) = P(z_t = j \mid z_{t-1} = i) = \frac{\exp\left[\beta_{ij,0} + \beta_{ij,1}\cos(2\pi t/24) + \beta_{ij,2}\sin(2\pi t/24)\right]} { \sum_{s=1}^K \exp\left[\beta_{is,0} + \beta_{is,1}\cos(2\pi t/24) + \beta_{is,2}\sin(2\pi t/24)\right] }.$ Emission probabilities are Gaussian, with label switching prevented by a positive ordering constraint on the means: $0 < \mu_1 < \mu_2 < \cdots < \mu_K.$

Posterior inference is conducted via Hamiltonian Monte Carlo (Stan, No-U-Turn), with effective sample fraction $> 0.5$ and split- $\hat{R}<1.05$ (Lu et al., 2023).

This type of encoder yields smooth, probabilistically coherent 24-hour rest-activity profiles $\gamma_t(1:K)$ . Empirically, the BCHMM outperforms time-homogeneous HMMs in state recovery (e.g., mean absolute bias for $\mu_1$ : 0.033 vs 0.267), Kullback–Leibler divergence (0.25 vs 0.76 for state 2), and aligns encoded rest-activity regularity with clinical outcomes such as diabetes risk (Lu et al., 2023).

2. Deep Neural Actigraphy Encoders

2.1 Convolutional Variational Autoencoders

A convolutional variational autoencoder (VAE) architecture encodes four-week actigraphy maps $x \in \mathbb{R}^{28\times2880\times1}$ (28 days, 2880 epochs/day at 30 s per epoch) into deterministic latent vectors $z \in \mathbb{R}^8$ . The encoder stack consists of two Conv2D layers (16/32 filters, $3\times3$ kernels), batch normalization, flattening, and dense layers producing mean and log-variance heads for $q_\phi(z\mid x)$ : $q_\phi(z|x) = \mathcal{N}(z; \mu_\phi(x), \operatorname{diag}(\sigma^2_\phi(x))).$ The latent $\mu_\phi(x)$ is taken as the compressed actigraphy code.

Training maximizes the ELBO: $\mathrm{ELBO} = \mathbb{E}_{z\sim q_\phi(z|x)} [\log p_\theta(x|z)] - KL(q_\phi(z|x) \parallel p(z)),$ with reconstruction via mean squared error. Encoded features $z$ can then be used by logistic regression to classify outcomes such as PTSD or depression, achieving AUC of 0.61–0.64 for classifiers on extracted features (Cakmak et al., 2020).

2.2 CNN-LSTM Feature Encoders for Sleep-Wake Detection

A 1D-CNN-LSTM encoder, designed for device-agnostic sleep-wake classification, processes sequences of feature vectors per 30 s epoch $x_i \in \mathbb{R}^{41}$ (derived from triaxial accelerometry and temperature) as follows:

Four layers of 1D causal dilated convolutions (kernel size 4, 64 filters), batch-normalized and ReLU-activated, with dropout.
A single-layer LSTM with 128 hidden units, followed by Gaussian noise, ReLU, and dropout.
Fully-connected dense layer for 3-way classification (sleep, arousal, wake), collapsed via a decision tree to binary sleep/wake.

Features include ENMO, Anglez, Anglex, PIM, ZCR, TAT, frequency, and band-power, all standardized per device (Montazeri et al., 1 Dec 2025). Tested across three devices and sleep disorders, this approach yields F1 = 0.86, sleep sensitivity 0.87, and wake specificity 0.78.

2.3 Sequential and Multi-Task 1D-CNNs

Earlier encoders (Granovsky et al., 2018) aggregate 6 h windows (721 points) of 30 s activity norm time series and pass them through a stack of 1D convolutional layers (e.g., 64 filters, kernel size 18), max-pooling, and fully-connected layers. Single-branch and multi-task variants exist: the latter decouple state estimation for the center epoch from window-wide sleep/awake distribution prediction. The resulting 128-dim embedding after the final FC layer is used for sleep/wake state classification or further statistical analysis.

3. Transformer-Based Foundation Encoders

The Pretrained Actigraphy Transformer (PAT) pioneers large-scale, foundation-level actigraphy encoding using a masked autoencoding paradigm (Ruan et al., 2024). PAT segments minute-level actigraphy for a 7-day window ( $T=10,080$ ) into $N=T/S$ non-overlapping patches, linearly projects each patch to $d_\text{model}$ -dim tokens, and adds sinusoidal positional encodings. This sequence is processed by $L$ transformer encoder blocks (parameters up to $2\times10^6$ for PAT-L).

The MAE pretext task randomly masks 90% of patches; a separate decoder reconstructs the full sequence from the unmasked patch encodings. The pretraining loss is mean squared error over all time points: $L_\text{mask} = \frac{1}{T} \sum_{t=1}^T (x_\text{std}[t] - \hat{x}[t])^2.$ After pretraining, encoder tokens or their averages can serve as compact actigraphy representations. PAT-L achieves AUC 0.771 for benzodiazepine usage classification on held-out test sets, surpassing baseline models (Ruan et al., 2024).

4. Actigraphy Encoders for Generative and Multimodal Pipelines

MotionTeller integrates a frozen PAT encoder into a multimodal generative pipeline for converting raw actigraphy ( $T=1,440$ minutes/day) into natural language daily summaries (Zhang et al., 25 Dec 2025). Minute-level sequences are split into 80 non-overlapping patches, embedded to 96-dim tokens, and passed through stacked transformer encoders. The resulting $H \in \mathbb{R}^{80\times 96}$ is projected—via a lightweight embedding head—into the token space of a decoder-only LLM (e.g., Gemma-2B). This projection is trained end-to-end with the LLM parameters frozen. After training, the PAT embeddings demonstrate increased cluster cohesion and more distinct circadian-behavioral archetypes, as quantified via PCA and silhouette analysis on projected feature space (Zhang et al., 25 Dec 2025).

5. Comparative Architectural and Methodological Summary

Encoder Type	Input Shape/Window	Architecture	Output Dim/Code
BCHMM (Lu et al., 2023)	24 h, T epochs (e.g., 5 min)	Bayesian HMM with circadian transitions	$\gamma_t(1:K)$ (state-prob trajectories)
ConvVAE (Cakmak et al., 2020)	28×2880×1 (4 weeks)	2D Conv → dense → VAE	$\mathbb{R}^8$
1D-CNN-LSTM (Montazeri et al., 1 Dec 2025)	sequence of 41-dim features	4× 1D Conv (K=4, 64f), 128-LSTM	128/3-way+binary outputs
1D-CNN (Granovsky et al., 2018)	6 h (721 points, 30 s)	3× Conv1D, FC(128)+softmax	128/4-way outputs
PAT (Ruan et al., 2024)	10 080×1 (week, 1 min)	Patching+Linear+12L Transformer, MAE	N× $d_\text{model}$ (e.g., 560×512)
PAT in MotionTeller (Zhang et al., 25 Dec 2025)	1 440×1 (day)	Patching+Linear+L×Transformer	80×96

Each approach balances context size, sequence modeling capacity, and output interpretability. Transformer-based encoders (PAT) are notable for their data scale, flexible downstream use, and principled pretraining. Probabilistic encoders offer strong interpretability in circadian biomedical domains, while deep neural encoders dominate classification and regression performance in large, multi-device studies.

6. Evaluation, Performance, and Interpretability

Actigraphy encoder evaluation encompasses reconstruction fidelity (e.g., VAE loss, MAE loss), downstream predictive performance (AUC for diagnosis, sleep-wake F1), and interpretability (filtered $\gamma_t(k)$ profiles, attention heatmaps, clustering in latent space). For PAT, patch-level attention and token importances are directly extractable and can be mapped to the original time axis. In clinical and research settings, encoders such as BCHMM provide direct probabilistic linkage to physiological states, while transformer encoders offer scalable, general-purpose feature representations suitable for further fusion with LLMs or other modalities (Lu et al., 2023, Ruan et al., 2024, Zhang et al., 25 Dec 2025).

7. Implementation Practices and Empirical Takeaways

State-of-the-art actigraphy encoders utilize rigorous preprocessing (e.g., median smoothing, segmentation, normalization), modular encoder stacks (convolution, attention, recurrence), and large-scale self-supervised or Bayesian training. Modern practice emphasizes:

Explicit handling of device-specific variation (device-specific scalers, robust feature extraction) (Montazeri et al., 1 Dec 2025).
Masked autoencoder objectives for representation learning (Ruan et al., 2024, Zhang et al., 25 Dec 2025).
Probabilistic label identification without arbitrary postprocessing (ordering constraints in BCHMM) (Lu et al., 2023).
Evaluation on diverse, realistic datasets (e.g., NHANES, multi-device, multi-disorder).

These encoders underlie current pipelines for circadian rhythm quantification, automated sleep-wake annotation, mental health prediction, and natural-language behavioral summarization based on wearable sensor data.