Generalizing ResAdapt beyond video resizing

Establish whether extending the ResAdapt training mixture to jointly include image and video data and implementing alternative pre-encoding visual budget operators, particularly hard frame selection, can generalize the learned input-side allocation policy beyond continuous resizing and yield consistent efficiency-preserving performance on image-centric benchmarks.

Background

ResAdapt is instantiated and trained primarily for video tasks using continuous per-frame resizing as the pre-encoding operator. The authors observe that transfer beyond this regime is uneven: while the policy sometimes increases fidelity for specific static images (e.g., charts), it fails to deliver uniformly efficiency-preserving gains on image-centric benchmarks.

To broaden validation and address the uneven transfer, the authors explicitly identify two open directions: (1) extending the training mixture to include both image and video data, and (2) exploring alternative pre-encoding operators beyond resizing, such as hard frame selection. These directions aim to test and potentially improve the generality of input-side allocation across modalities and operators.

References

Extending the training mixture to image–video data and exploring alternative operators, such as hard frame selection, remain open problems.

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning  (2603.28610 - Liao et al., 30 Mar 2026) in Limitations and Future Work, item (iii)