Defining expert specialization in high-dimensional continuous modalities
Determine a general framework for defining and assigning expert specialization in sparse Mixture-of-Experts architectures for high-dimensional, continuous modalities in multimodal generative models. Specifically, ascertain whether human motion experts should specialize by anatomical parts (for example, limbs) versus motion styles, and whether image experts should specialize by spatial regions, object types, or levels of visual abstraction, in order to support modularity across modalities.
References
A persistent open question is how to define expertise within high-dimensional, continuous spaces: should motion experts focus on limbs or styles? Should image experts specialize by region, object type, or visual abstraction? While some models like RAPHAEL and MoLE explore spatial and part-specific specialization, a general framework for defining modularity across modalities is still lacking.