Privacy-Preserving Imputation via Federated Learning
- The paper introduces privacy-preserving imputation by leveraging federated learning to collaboratively train models without exposing raw data.
- It details techniques such as feature-level translation, zero-knowledge proofs, and secure aggregation to address multimodal, temporal, and adversarial challenges.
- Empirical results across healthcare, energy, and sensor networks demonstrate improved imputation accuracy, lower communication overhead, and robust privacy protection.
Privacy-preserving imputation via federated learning encompasses a family of methodologies for addressing missing data in distributed cohorts while minimizing raw data exposure. In federated learning (FL), multiple clients (often institutions or edge devices with heterogeneous datasets) jointly train imputation or downstream predictive models under the orchestration of a server, sharing only model parameters, feature representations, aggregated statistics, or encrypted data. The primary objective is dual: maximizing imputation and prediction performance while applying rigorous measures to prevent privacy leakage, even under adversarial or untrusted conditions.
1. Federated Imputation: Problem Setting and Motivation
Missing data is endemic in federated data ecosystems, particularly in healthcare, energy, and sensor networks. Factors such as variable acquisition protocols, cost/access disparities, retrospective cohorts, sensor faults, and privacy restrictions frequently result in some modalities, channels, or timepoints being absent for subsets of clients. Naïve imputation methods such as zero-filling or mean-imputation are suboptimal and may substantially degrade downstream task performance.
The federated learning paradigm enables clients to keep their raw data local, updating global model parameters collaboratively. This arrangement accommodates privacy and legal constraints, but necessitates novel methodologies for handling missingness. Three prominent scenarios are addressed:
- Multimodal federated imputation: Clients may have access to different subsets of modalities (e.g., image, text, genomics).
- Temporal imputation across distributed time series: Clients observe incomplete, irregularly sampled sequences.
- Fully distributed imputation with untrusted or malicious participants: Adversaries may attempt to reconstruct local data or corrupt model updates.
The common thread in these methodologies is the design of imputation mechanisms that avoid direct data sharing, leveraging compressed representations, statistical summaries, or cryptographically protected parameters.
2. Feature-Based Imputation in Multimodal Federated Learning
In the "Multimodal Federated Learning With Missing Modalities through Feature Imputation Network" (FIN) approach (Poudel et al., 26 May 2025), clients are modeled as holding private tuples , denoting image, text, and label respectively. The global model is orchestrated as the composition of two modality-specific encoders , , a fusion operator (concatenation), and a classifier head :
To address missing modalities, FIN introduces feature-level translators:
built as lightweight Transformer decoder stacks (6 layers, 4 attention heads, 1024-dimensional hidden size, ). Imputation is performed at the encoder bottleneck (latent) level, where or reconstruct the missing modality's features for unimodal samples.
Training optimizes a compound objective:
where is the mean-squared error in feature space for multimodal clients, and is a classification cross-entropy. Federated averaging (FedAvg) aggregates parameter updates.
Empirical results on MIMIC-CXR, NIH Open-I, and CheXpert datasets demonstrate that FIN achieves macro-AUC of 86.2% (homogeneous) and 77.9% (heterogeneous) for unimodal image clients, outperforming zero/uniform filling and federated generative report models. Notably, the feature translator method approaches the performance of public-data-based cross-modal augmentation without requiring access to real or synthetic external data. FIN's low-dimensional bottleneck representations also reduce computational and communication overhead by an order of magnitude relative to input-level generative models.
FIN requires availability of a minority of multimodal clients in each round for effective translator supervision and currently lacks formal differentially private mechanisms or cryptographic protocols. The method is extensible to additional modalities and alternate feature translators, with plausible application beyond medical imaging into domains such as sensor networks or recommender systems.
3. Secure Imputation with Verifiable Privacy and Trust-Aware Aggregation
In industrial and energy-sector FL, the ZTFed-MAS2S framework (Li et al., 24 Aug 2025) addresses missing wind power data using a multi-headed attention-based sequence-to-sequence (MAS2S) model. ZTFed-MAS2S is distinguished by its zero-trust architecture, combining verifiable differential privacy (DP) via non-interactive zero-knowledge proofs (NIZK), and dynamic trust-aware aggregation (DTAA).
The system enforces differential privacy by clipping client model parameters and adding Gaussian noise calibrated by :
where is governed by the Gaussian mechanism. Each client additionally generates a Schnorr-style NIZK to prove correct noise addition.
DTAA calculates pairwise cosine similarities between perturbed updates to construct a trust graph, propagating trust scores and filtering out anomalous clients via median absolute deviation. Final global model aggregation is then conducted over the trusted set.
The MAS2S imputation model itself is a BiLSTM encoder–decoder with multi-head attention, optimized locally via Adam for mean absolute error between true and reconstructed sequences.
Communication overhead is addressed with sparsity-driven and quantization-based compression, plus AES-CBC encryption and HMAC for confidentiality and integrity.
Empirical validation on the NREL wind farm dataset (sequence length , up to 90\% missing data) shows that ZTFed-MAS2S achieves RMSE as low as 0.0411 at extreme missingness, outperforming baselines by substantial margins and maintaining robustness under adversarial sign-flipping of updates. The DP-NIZK+CIV machinery enables verifiable privacy preservation while reducing communication costs by more than 50% against FHE/TSS. Tradeoff curves reveal that at , the method achieves a membership inference attack success rate of 59.2% with 82.4% utility.
A notable significance is the integration of verifiability and trust scoring, which is critical in open, zero-trust industrial environments. ZTFed-MAS2S is effective for privacy-preserving imputation in practical, large-scale, and adversarial settings where traditional federated protocols are susceptible to privacy attacks and untrustworthy aggregation.
4. Markovian Temporal Imputation via Federated Aggregation
For time-series data with irregular sampling, as in multi-centric ICU environments, the Federated Markov Imputation (FMI) strategy (Düsing et al., 25 Sep 2025) leverages global transition models constructed via secure aggregation of local Markov transition statistics.
Concretely, each scalar feature is discretized into bins; each ICU computes local transition counts between bins over observed data. Through secure aggregation, the server constructs global aggregated counts and normalizes to obtain transition probabilities . Missing bins are imputed via Markov inference, either using maximum-likelihood single-step completion:
or using Viterbi-style dynamic programming for contiguous missing segments. The process is entirely privacy-preserving, since no raw time-series ever leave the clients.
In a two-phase protocol, this imputation is followed by downstream outcome prediction via FL using a 3-layer LSTM + MLP fed on imputed data, with FedAvg aggregation.
Evaluation on the MIMIC-IV ICU dataset, under both regular and irregular temporal sampling, reveals that FMI attains AUC of 0.8878 in regular settings and 0.8629 in irregular settings, outperforming local mean- and local Markov-based imputation by 0.02–0.06 AUC. In irregular sampling regimes, FMI is operational while local Markov imputation fails on coarser grids. Transition sharing via secure aggregation enhances temporal modeling, but FMI's current limitations include its first-order Markov assumption, discretization’s loss of fine-grained clinical variation, and the current absence of formal differential privacy on transition matrices.
5. Loss Functions, Training Algorithms, and Optimization Objectives
All frameworks described above are grounded in composite loss objectives, govern both imputation fidelity and downstream task objectives.
- Feature-based approaches (e.g., FIN): optimize mean-squared reconstruction error over paired bottleneck features and cross-entropy loss for task prediction. At unimodal clients, only task loss is active; multimodal clients train both translation and prediction jointly.
- Sequence-to-sequence models (e.g., MAS2S): minimize time-averaged mean absolute error across full imputation outputs; local client optimization uses Adam, with scheduled rounds for uploading perturbed parameters for federated aggregation.
- Markovian approaches (e.g., FMI): imputation uses maximum-likelihood inference based on global transition probabilities, with privacy arising from secure aggregation; downstream tasks use standard predictive loss functions (e.g., cross-entropy for LSTM classifiers).
All methods employ the FedAvg aggregation protocol unless replaced by more robust variants (e.g., DTAA).
6. Privacy Mechanisms and Security Considerations
Privacy preservation is fundamental to these federated imputation systems. Comparative privacy mechanisms are summarized below:
| Method | Main Privacy Mechanism | Exposure Mitigated |
|---|---|---|
| FIN (Feature) | Bottleneck feature sharing only | Raw data, full features |
| ZTFed-MAS2S | DP + NIZK + encryption/DTAA | Raw data, parameter inversion, manipulation |
| FMI (Markov) | Secure aggregation on counts | Raw time series, local stat leaks |
FIN's approach inherently reduces information leakage via low-dimensional shared features, but does not currently employ formal cryptographic or DP methods. ZTFed-MAS2S augments DP with stringent verifiability (NIZK) and robust aggregation to ensure that no trusted party is required. FMI deploys secure aggregation for Markov counts; formal DP on summary statistics is suggested as a future enhancement.
Limitations include situations with zero multimodal clients (FIN) or the absence of formal DP on summary statistics (FMI). For all methods, adversarial model inversion attacks remain a research concern, with DP, secure aggregation, or encryption as candidate mitigations.
7. Comparative Performance and Generality
All presented methods report strong task- and domain-specific empirical improvements compared to local or naïve imputation. High-level benchmarks in their respective evaluations are:
- FIN: Macro-AUC for unimodal clients (hetero setting) = 77.9% vs. 67.3% (R2Gen), 72.8% (zero-filling).
- ZTFed-MAS2S: At 90% missing, RMSE = 0.0411 vs. next-best 0.0712; up to 28.2% RMSE reduction under adversarial updates with DTAA vs. alternative aggregates.
- FMI: AUC = 0.8629 (irregular sampling) vs. 0.7961 (local mean).
These approaches are broadly extensible. FIN may generalize to multimodal learning tasks outside healthcare. ZTFed-MAS2S’s trust/privacy architecture suggests applicability to other critical-infrastructure FL settings. FMI’s secure Markov-chain aggregation is lightweight and conceptually applicable to other time series domains with similar privacy constraints and temporal heterogeneity.
Plausible implications are that future work will integrate additional modalities, cryptographic protocols, or formal DP, and pursue joint end-to-end imputation and predictive learning—potentially increasing both utility and privacy guarantees.