FlexOlmo: Modular Open Language Model
- FlexOlmo is a modular open language model framework that uses a mixture-of-experts transformer architecture to integrate diverse, domain-specific data without centralized aggregation.
- It employs a domain-informed routing mechanism that flexibly includes or excludes expert modules at inference time, ensuring regulatory compliance and privacy preservation.
- Empirical evaluations indicate FlexOlmo improves performance by up to 41% over public-only models while outperforming traditional model merging approaches.
FlexOlmo refers to a class of open LLMs designed around flexible data usage through distributed, privacy-preserving training and parameter modularity. Developed as a solution for scenarios where central data aggregation is impractical or undesirable, FlexOlmo employs a mixture-of-experts (MoE) transformer architecture in which each expert is trained on separate (often closed or proprietary) datasets. Through a novel domain-informed routing mechanism, these experts can be flexibly combined or excluded at inference time, allowing for both regulatory compliance and collaborative model improvement. This paradigm expands the capabilities of language modeling in regulated industries and sensitive-data contexts by enabling joint utilization of separated data sources with fine-grained inclusion control, while surpassing prior methods in performance and operational versatility (Shi et al., 9 Jul 2025).
1. Architectural Principles
FlexOlmo is structured on a mixture-of-experts (MoE) transformer backbone. Unlike conventional transformers with monolithic feedforward network (FFN) blocks, each FFN in a FlexOlmo layer is replaced by a cluster of independently trained expert modules, coordinated by a learned router. A designated “public” expert is trained on open data, while additional “domain-specific” experts are trained on closed, locally maintained datasets.
The routing function is informed by domain-specific embeddings, constructed by averaging representations from a pretrained embedder across a subset of domain data. The router matrix thus comprises one row per expert:
where is the number of domain-specific experts and is the embedding size. Each row is initialized as:
with a subset of domain 's data and an off-the-shelf embedder.
At inference, for token embedding , output is computed as:
where is the th expert's module and routing is over the top-k experts per input.
2. Distributed Training without Data Sharing
Training in FlexOlmo explicitly avoids centralized data pooling. It consists of two decoupled phases:
- The public model is trained on a large, openly available corpus.
- For each closed dataset , an expert FFN is independently trained, starting from the public model’s FFNs. During this training:
- The public expert and core shared parameters (e.g., attention weights) are frozen.
- Only the domain-specific expert FFN and its router embedding are updated.
- Training is conducted with both the public expert (frozen) and the trainable expert active, enforcing “training to coordinate” so each expert remains compatible for later merging.
No raw data is shared at any stage. Once training is complete, the experts and their router embeddings are merged via simple concatenation, and further joint training is unnecessary. This methodology enables multiple data owners to contribute to a unified model without ceding control over their sensitive or proprietary data.
3. Data-Flexible Inference
A central feature of FlexOlmo is its “data-flexible” inference mechanism. Each expert module and its associated router embedding correspond to a specific data domain. At inference time, users—such as data owners or downstream application operators—can include or exclude any subset of expert modules simply by adding or removing their router row from . No retraining or fine-tuning is needed. This design allows:
- Opting out of certain domains due to evolving data licensing, privacy, or regulatory concerns.
- Tailoring model performance or risk profiles to specific deployment settings.
- Enabling or disabling collaboration among different data owners on demand.
This mechanism distinguishes FlexOlmo from standard MoE models, which require joint training and centrally fixed expert compositions, and from prior merging approaches which often demand additional retraining for each configuration.
4. Evaluation on Diverse Downstream Tasks
FlexOlmo models were evaluated on 31 downstream tasks spanning 10 categories, including general language benchmarks, domain-specific tasks (such as mathematics, coding, news generation, and creative writing), and both open and closed data domains. Notable empirical outcomes include:
- An average relative performance improvement of 41% over the public-only baseline.
- Outperforming established model merging methods (e.g., model soup, branch-train-merge, prompt-routing) by approximately 10.1% on average.
- Near matching of an unrestricted MoE (all-data, joint-training) baseline under similar compute and data constraints (e.g., FLOPs, Data), falling only marginally short of the full-control MoE ( FLOPs, Data).
Performance details are cataloged in the corresponding tables of the originating paper, indicating that FlexOlmo effectively integrates independently trained experts while maintaining state-of-the-art accuracy.
5. Applications in Regulated and Collaborative Settings
FlexOlmo addresses the demands of settings where data sensitivity, compliance, or organizational compartmentalization preclude traditional data aggregation:
- Healthcare, finance, and legal industries, where access to certain texts is highly regulated.
- Multi-institution collaboration, enabling each stakeholder to control their data's downstream influence.
- Incremental model composition for data owners with evolving opt-in/opt-out requirements.
Such operational flexibility supports not only regulatory compliance but also promotes broader cooperation and data utilization without sacrificing privacy or oversight. The model's explicit modularity and absence of retraining overhead for expert inclusion/exclusion have practical implications for enterprise deployments.
6. Comparison with Prior MoE and Model Merging Approaches
FlexOlmo is differentiated from prior mixture-of-experts and ensembling methods via its training and routing strategy:
- Standard MoE: Requires centralized, joint training and static domain structure.
- Model Soup and BTM: Merge models through parameter averaging or sequential retraining, often resulting in suboptimal domain coordination and requiring new training/aggregation rounds for every changed configuration.
- FlexOlmo: Independently trained experts coordinate through a domain-informed router, learn to work with the public expert in a “training to coordinate” paradigm, and allow immediate, arbitrary modular inclusion/exclusion at inference.
Empirical results reflect that this approach not only matches or exceeds the performance of more rigid or computationally intensive methods but does so with significant operational advantages pertaining to flexibility and privacy.
7. Implications and Extensions
FlexOlmo establishes a new template for building modular, privacy-preserving LLMs with dynamic data participation. Its design enables further research and application in domains requiring federated learning principles, modular model governance, or variable data/deployment profiles. While the core approach is demonstrated in the context of transformer-based LMs, the domain-informed modular routing methodology has potential applicability in other machine learning systems that require distributed, composable architectures under data governance constraints.
A plausible implication is that further developments may expand on this routing architecture or extend the method to other modalities and types of expert modules, as well as optimizing for efficiency as the number of experts and deployments scale. The fundamental principle of decoupling training and inference composition around domain-informed, easily updatable routing is likely to be influential in future collaborative AI system design.