Swedish Healthcare Quality Registries
- Swedish Healthcare Quality Registries are comprehensive datasets that capture large-scale clinical and demographic data essential for quality assurance, epidemiological research, and policy evaluation.
- They employ advanced imputation methods such as MICE and bigMICE, leveraging distributed frameworks to manage high missingness and scalability challenges efficiently.
- The registries enable practical healthcare improvements by supporting robust statistical modeling, data integration, and rapid evaluations exemplified by the Swedish National Diabetes Registry.
Swedish Healthcare Quality Registries constitute comprehensive, high-dimensional data resources designed to support quality assurance, epidemiological research, and policy evaluation within Swedish healthcare. These registries collect clinical and demographic data at scale, encompassing millions of patients across various disease domains and interventions. They pose unique challenges and opportunities for advanced statistical analysis due to their size, heterogeneity, and the prevalence of missing data, necessitating scalable and memory-efficient tools for reliable inference and data integration (Morvan et al., 29 Jan 2026).
1. Structural Features and Data Scope
Swedish Healthcare Quality Registries are characterized by expansive sample sizes ( on the order of millions) and moderate to high numbers of variables ( commonly 20–50 or more per registry). Typical registries, such as the Swedish National Diabetes Registry, encapsulate data matrices of variables. The data typically include both structured clinical measurements and patient-reported outcomes, with longitudinal updates and substantial variation in missingness patterns both between and within variables.
Variables may exhibit missingness proportions ranging widely, with some features—such as biomarker measurements—frequently missing in excess of 80%. In a documented use case from the Swedish registry, the glomerular filtration rate (GFR) variable had approximately 99% missingness, yet due to the underlying population size (), the observed stratum () still facilitated robust imputation and modeling (Morvan et al., 29 Jan 2026). A plausible implication is that such registry data, even at extreme missingness levels for individual variables, retain analytic viability for complex modeling approaches.
2. Missing Data Methodologies in Registry Contexts
The high prevalence and structural patterns of missing data in the Swedish registries necessitate principled handling to avoid bias and inefficiency. Multiple Imputation by Chained Equations (MICE) is the prevailing method, operationalizing Rubin’s multiple imputation framework for multivariate datasets.
Consider a partially observed matrix with observed and missing subsets. MICE generates completed datasets through an iterative sequence:
- Initialize 0 for all 1.
- For 2 and 3:
- Sample model parameters 4.
- Draw 5.
After 6 iterations, imputed values are pooled with Rubin’s rules for final inference: 7
Standard R implementations, such as mice, require all data in RAM, incurring 8 memory and superlinear runtime growth, which is infeasible for registry-scale datasets.
3. Distributed Imputation Solutions: bigMICE
To address these scalability constraints, the bigMICE package reimplements MICE using Apache Spark DataFrames via the sparklyr interface, leveraging Spark MLlib for model operations. This distributed architecture enables:
- Partitioned model fitting/prediction via Spark’s parallel executors.
- Asynchronous execution of independent imputations (9).
- Checkpointing of intermediate DataFrames to truncate Spark lineage and bound memory requirements.
The framework exploits two levels of parallelism: within-imputation model fitting and cross-imputation concurrency. Driver and executor heap memory is strictly controlled, with typical parameters set by: 0 where 1 is driver memory and 2 is the Spark memory fraction.
Table: Memory Usage and Runtime by Sample Size (Swedish Registry, mice vs. bigMICE (Morvan et al., 29 Jan 2026))
| 3 | mice RAM (GB) | bigMICE heap (GB) | mice Runtime (min) | bigMICE Runtime (min) |
|---|---|---|---|---|
| 1,000 | 0.4 | 7.9 | 0.012 | 2.22 |
| 598,253 | 2.6 | 7.6 | 10.93 | 5.45 |
| 14,632,799 | 40.7 | 11.6 | 158.09 | 36.75 |
Memory footprint for bigMICE is capped by configuration, and runtime scales sublinearly due to the use of distributed learners.
4. Practical Considerations for Registry-Scale Analysis
BigMICE’s architecture supports processing on ordinary workstations (e.g., 16 GB RAM, 4 cores) by tuning Spark-specific parameters and checkpointing intervals. Key guidelines include:
- Small data (4): 5, maxit = 5, driver-memory = 4G.
- Medium data (6): 7–10, maxit = 5–10, driver-memory = 8–12G.
- Very large data (8 or 9): 0–5, maxit = 5, driver-memory = 16–24G, checkpoint-frequency = 10.
Best practices for missing data patterns include random-sampling initialization for MCAR scenarios and MeMoMe initialization for MAR; random forests are recommended for non-linear or high-missingness variables. The Ω² diagnostic (1 = fraction of missing information per parameter) is monitored for imputation quality.
Performance tuning involves optimizing storage (using SSDs) and balancing cluster resources with data partitioning.
5. Empirical Findings from Swedish Registries
Empirical evaluation using the Swedish National Diabetes Registry demonstrates that bigMICE delivers near-constant memory usage—irrespective of row count—when driver memory is tuned, whereas legacy R implementations exhibit 2 memory growth. Runtime with bigMICE is 3–4 faster on large 5 due to parallelization.
Notably, when imputing variables with up to 99% missingness and 6, the residual mean squared error (RMSE) for key variables remains nearly flat until 7, indicating the robustness of the approach in extensive registry datasets.
6. Software Integration and Ecosystem Context
BigMICE is implemented in R (≥4.0) with Spark 4.0.0 and sparklyr (v1.9.1), dependent on Spark MLlib for model fitting and requiring checkpointing support (HDFS or local directory). Data ingestion supports formats such as CSV or Parquet via spark_read_csv or spark_read_parquet. Installation is available via GitHub with devtools. The distributed implementation encapsulates imputation, prediction, and pooling within Spark’s DAG engine, replacing in-RAM data frames with RDDs and serial regressions with scalable MLlib learners.
This infrastructure supports analysis pipelines where Swedish Healthcare Quality Registries are a central data source, enabling scalable and rigorous inferential workflows despite pronounced missingness and resource constraints (Morvan et al., 29 Jan 2026).