PrE-Text: Training LLMs on Private Federated Data in the Age of LLMs
PrE-Text presents a novel methodology for generating differentially private (DP) synthetic textual data, aiming to surmount the limitations of on-device training in the domain of Federated Learning (FL). The drawbacks of on-device training, such as insufficient device capabilities for large models, extensive communication, and computational overhead, along with deployment difficulties, are significantly mitigated by the proposed method.
The paper's approach, PrE-Text, introduces Private Evolution-Text (PE-Text), leveraging recent algorithmic advancements in DP synthetic data generation. This approach outperformed the traditional small model training done on private data directly on user devices across various privacy regimes. The highlights demonstrate the efficacy of PrE-Text under practical privacy constraints, achieving substantial reductions in communication rounds (up to 9×), client computation per round (up to 6×), and communication costs (up to 100×).
Contributions and Algorithm Design
PrE-Text builds on the principles of Differential Privacy to provide a robust and efficient mechanism for generating synthetic textual datasets:
- Differentially Private (DP) Synthetic Text Generation: PrE-Text starts with an initial set of public data samples and iteratively refines these samples through variations guided by private information derived from user data. This iterative process utilizes a variation mechanism specific to text data, involving masked LLMs for refining text samples.
- Expand Phase: Crucially, PrE-Text incorporates an expansion phase where the final DP synthetic data obtained from iterative refinement is further expanded using LLMs like LLaMA-2-7B, leveraging their generative capabilities without incurring further privacy cost, thanks to DP's post-processing property.
Experimental Performance
The paper provides empirical evidence through extensive experimentation across various datasets—Jobs, Forums, Microblog, and Code:
- Small Models on-device: For smaller models deployable on client devices (e.g., DistilGPT2), PrE-Text synthetic data allowed these models to achieve higher accuracy and lower cross-entropy loss than models trained with traditional DP-FL methods such as DP-FedAvg and DP-FTRL. For example, PrE-Text outperformed other DP training methods at ϵ=1.29 with accuracy improvements ranging from approximately 1.3% to 3.8% across the datasets.
- Large Models on-server: For scenarios where on-device storage is infeasible, large models obtained significant improvements when fine-tuned on PrE-Text synthetic data. LLaMA-2-7B models showed notable enhancements in next-token prediction accuracy and cross-entropy loss, improving model utility markedly over the non-finetuned baseline.
The results underscore PrE-Text's superior performance in both small and large model settings while preserving privacy. The efficiency improvements in communication and computation further highlight its practical advantages.
Implications and Future Directions
The implications of PrE-Text are twofold:
- Practical Utility in Privacy-Preserving Technologies: By substantially reducing the communication and computational burden, PrE-Text makes the deployment of privacy-preserving LLMs more feasible in real-world applications, such as mobile assistants and personalized education platforms.
- Future Development in DP Data Generation: The iterative and expansion-based approach sets the stage for future research on synthetic data generation, not limited to text but potentially adaptable to other data modalities like images and structured data. Improving the variation and expansion mechanisms further could yield even higher fidelity synthetic datasets.
Speculation on Future Developments
Future work in this area might focus on several promising directions:
- Advanced Variations Techniques: Integrating more sophisticated text generation techniques, such as incorporating transformers' full capabilities in the variation phase.
- Power-Efficient Federated Learning: Enhancing the computational efficiency of the client devices could enable more frequent updates and dynamic adaptation of training protocols.
- Combining Synthetic Data with Real Data: Investigating hybrid approaches that combine DP synthetic datasets with carefully aggregated real data could provide even more powerful models without significant privacy trade-offs.
The PrE-Text methodology propounds a significant step forward in the field of privacy-preserving AI, potentially influencing the design and deployment of next-generation user-centric applications while upholding stringent privacy guarantees.