Scaling generative approaches to more challenging settings

Develop scalable methods to extend decoder-based generative perception approaches that constrain the decoder to the interaction function class F_int and invert it via gradient-based search and generative replay, so that these methods operate effectively in more challenging visual settings with greater complexity and scale while maintaining compositional generalization guarantees.

Background

The paper argues that achieving data-efficient compositional generalization in visual perception is theoretically feasible by adopting a generative approach: constraining a decoder to the function class F_int and obtaining representations by inverting this decoder via search and replay. In contrast, constraining encoders to the corresponding inverse class G_int is generally infeasible due to dependence on unknown out-of-domain manifold geometry.

Empirically, the authors demonstrate significant out-of-domain gains on photorealistic datasets using generative inversion strategies (gradient-based search and generative replay), highlighting the promise of the approach. However, they note limitations related to the scope of the studied settings and function classes, and explicitly state that scaling these generative approaches to more challenging settings remains open.

References

While scaling such generative approaches to more challenging settings remains an open problem, we hope our findings will inspire renewed interest in this direction.

Generation is Required for Data-Efficient Perception (2512.08854 - Brady et al., 9 Dec 2025) in Conclusion (Section 7)