- The paper presents a novel modular architecture that trains expert modules on disjoint, local datasets, eliminating centralized data pooling.
- It leverages domain-informed router embeddings with a negative bias to selectively activate experts, enhancing task-specific performance.
- Empirical results show a 41% average improvement over public-only models and robust data opt-out guarantees for privacy-sensitive applications.
FlexOlmo: Modular LLMs for Flexible, Data-Constrained Training and Inference
FlexOlmo introduces a modular approach to LLM (LM) pretraining and inference that directly addresses the challenges of data privacy, regulatory compliance, and dynamic data access in real-world settings. The core contribution is a mixture-of-experts (MoE) architecture that enables distributed, asynchronous training of expert modules on disjoint, locally maintained datasets, with a mechanism for flexible inclusion or exclusion of these experts at inference time—without requiring joint access to all data or further retraining.
Problem Motivation and Context
Traditional LM pretraining requires centralized aggregation of all data, which is infeasible for organizations with sensitive, proprietary, or regulated datasets. Existing solutions such as federated learning (FL) offer some privacy guarantees but suffer from performance degradation, high synchronization costs, and limited practical adoption for large-scale LMs. Model merging and parameter-efficient fine-tuning methods have been explored, but they typically require joint training or are limited in expressivity and modularity.
FlexOlmo is designed to meet two critical requirements:
- Distributed, Modular Training: Each data owner trains an expert module locally, using only their own data and a shared public model as an anchor, with no need to share raw data or participate in synchronized updates.
- Data-Flexible Inference: At inference, any subset of expert modules can be included or excluded, providing strict opt-out guarantees and fine-grained control over data influence.
Architecture and Training Methodology
FlexOlmo adopts a standard MoE transformer architecture, replacing each feedforward network (FFN) in the transformer blocks with a router and a set of expert FFNs. The key innovations are:
- Independent Expert Training: Each expert is trained on its local dataset, with the public expert and shared attention layers frozen. This "coordinated training" ensures that all experts remain compatible for later merging, as they are anchored to the same public model.
- Domain-Informed Router: Instead of joint router training, each expert learns a router embedding initialized from domain-specific document embeddings (e.g., using GritLM). These embeddings are finetuned during expert training and concatenated to form the router matrix at merge time.
- Negative Bias for Selectivity: A negative bias term is added to each expert's router embedding, ensuring that the expert is only activated for highly relevant inputs, which improves multi-expert merging and reduces interference.
- Optional Router Tuning: If data owners can identify proxy samples in the public data, a lightweight router finetuning step can further improve routing quality without exposing private data.
Implementation Considerations
- Expert Training: Each expert is trained as a two-expert MoE (public + local expert), with only the local FFN and router embedding updated. This can be implemented using standard deep learning frameworks (e.g., PyTorch) with modular checkpointing.
- Router Embedding Initialization: Domain embeddings are computed by averaging document embeddings from a pre-trained encoder over a sample of the local dataset.
- Merging: At inference, the public and all available expert modules are loaded, and their router embeddings are concatenated. Exclusion of an expert is achieved by removing its embedding and FFN from the model.
- Scalability: The approach is compatible with large-scale distributed training, as each expert can be trained independently and asynchronously.
Empirical Results
FlexOlmo is evaluated on a curated corpus (FlexMix) comprising a public dataset and seven domain-specific, closed datasets (e.g., news, code, academic, Reddit). Models with up to 37B parameters (20B active) are tested on 31 diverse downstream tasks.
Key findings:
- Performance: FlexOlmo achieves a 41% average relative improvement over the public-only model and outperforms prior model merging methods (e.g., model soup, BTM, BTX) by 10.1% on average.
- Specialization and Synergy: Individual experts excel on their domains but degrade elsewhere; FlexOlmo combines their strengths, often matching or exceeding the best individual expert on each task.
- Data Opt-Out: Excluding an expert at inference sharply reduces performance on its domain but leaves other domains unaffected, demonstrating strict opt-out guarantees.
- Scaling: Applying the FlexOlmo recipe to a strong 7B model pretrained on 4T tokens yields further improvements, especially on math and code tasks, without catastrophic forgetting.
Practical Implications
FlexOlmo provides a viable solution for organizations in regulated industries (e.g., healthcare, finance) and collaborative research settings where data cannot be pooled. It enables:
- Asynchronous, privacy-preserving collaboration: Data owners contribute model improvements without sharing data.
- Dynamic compliance: Models can be reconfigured at inference to respect data licensing, user permissions, or regulatory changes.
- Continual updates: New experts can be added as new data becomes available, without retraining the entire model.
Deployment and Resource Considerations
- Inference Cost: MoE inference is more expensive than dense models, as multiple experts are activated per token. However, performance plateaus after activating four experts, allowing for efficient sparse inference.
- Data Extraction Risk: Empirical analysis shows low but nonzero risk of data extraction from expert weights. For sensitive data, differentially private training of experts is recommended, which is orthogonal to the FlexOlmo architecture.
Theoretical and Future Directions
FlexOlmo demonstrates that modular, expert-based architectures can achieve strong performance without centralized data, challenging the necessity of joint training for high-capacity LMs. The approach opens several avenues for future research:
- Advanced Routing: Learning more sophisticated, context-aware routing strategies could further improve expert utilization and reduce interference.
- Hierarchical and Dynamic Expert Composition: Exploring hierarchical MoE structures or dynamic expert creation/destruction could enhance scalability and adaptability.
- Formal Privacy Guarantees: Integrating formal privacy-preserving mechanisms (e.g., differential privacy, secure aggregation) at the expert level.
- Federated and Cross-Organization Collaboration: Enabling large-scale, cross-institutional LM training with strict data governance.
Conclusion
FlexOlmo provides a practical, modular framework for training and deploying LMs under real-world data constraints. Its architecture and training methodology enable strong performance, strict data opt-out, and flexible collaboration, making it a compelling solution for regulated and privacy-sensitive domains. The empirical results validate the effectiveness of modular expert merging and suggest that further advances in modularity and routing could play a central role in the next generation of open, collaborative LLMs.