Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels (2403.14430v1)
Abstract: This paper focuses on open-ended video question answering, which aims to find the correct answers from a large answer set in response to a video-related question. This is essentially a multi-label classification task, since a question may have multiple answers. However, due to annotation costs, the labels in existing benchmarks are always extremely insufficient, typically one answer per question. As a result, existing works tend to directly treat all the unlabeled answers as negative labels, leading to limited ability for generalization. In this work, we introduce a simple yet effective ranking distillation framework (RADI) to mitigate this problem without additional manual annotation. RADI employs a teacher model trained with incomplete labels to generate rankings for potential answers, which contain rich knowledge about label priority as well as label-associated visual cues, thereby enriching the insufficient labeling information. To avoid overconfidence in the imperfect teacher model, we further present two robust and parameter-free ranking distillation approaches: a pairwise approach which introduces adaptive soft margins to dynamically refine the optimization constraints on various pairwise rankings, and a listwise approach which adopts sampling-based partial listwise learning to resist the bias in teacher ranking. Extensive experiments on five popular benchmarks consistently show that both our pairwise and listwise RADIs outperform state-of-the-art methods. Further analysis demonstrates the effectiveness of our methods on the insufficient labeling problem.
- A stochastic treatment of learning to rank scoring functions. In WSDM, pages 61–69, 2020.
- Learning to rank using gradient descent. In ICML, pages 89–96, 2005.
- Learning to rank with nonsmooth cost functions. NeurIPS, 19, 2006.
- Learning to rank: from pairwise approach to listwise approach. In ICML, pages 129–136, 2007.
- Using the future to” sort out” the present: rankprop and multitask learning for medical risk evaluation. In NeurIPS, pages 959–965, 1995.
- Plot: Prompt learning with optimal transport for vision-language models. In ICLR, 2022.
- Not all relevance scores are equal: Efficient uncertainty and calibration modeling for deep retrieval models. In ACM SIGIR, pages 654–664, 2021.
- Pranking with ranking. In NeurIPS, pages 641–647, 2001.
- Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 26, 2013.
- Hierarchical object-oriented spatio-temporal reasoning for video question answering. In IJCAI, pages 636–642, 2021.
- Heterogeneous memory enhanced multimodal attention model for video question answering. In CVPR, pages 1999–2007, 2019.
- Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
- Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, pages 1050–1059, 2016.
- Ota: Optimal transport assignment for object detection. In CVPR, pages 303–312, 2021.
- Knowledge distillation: A survey. IJCV, 129:1789–1819, 2021.
- Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Parameter-efficient transfer learning for nlp. In ICML, pages 2790–2799, 2019.
- Location-aware graph convolutional networks for video question answering. In AAAI, pages 11021–11028, 2020.
- Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR, pages 2758–2766, 2017.
- Cumulated gain-based evaluation of ir techniques. TOIS, 20(4):422–446, 2002.
- Paraphrasing complex network: Network compression via factor transfer. In NeurIPS, 2018.
- Hierarchical conditional relation networks for video question answering. In CVPR, pages 9972–9981, 2020.
- Tvqa: Localized, compositional video question answering. In EMNLP, page 1369, 2018.
- Less is more: Clipbert for video-and-language learning via sparse sampling. In CVPR, pages 7331–7341, 2021.
- Hero: Hierarchical encoder for video+ language omni-representation pre-training. In EMNLP, pages 2046–2065, 2020.
- Svitt: Temporal learning of sparse video-text transformers. In CVPR, pages 18919–18929, 2023a.
- Curriculum temperature for knowledge distillation. In AAAI, pages 1504–1512, 2023b.
- Adversarial partial domain adaptation by cycle inconsistency. In ECCV, pages 530–548, 2022a.
- Diversifying spatial-temporal perception for video domain generalization. In NeurIPS, 2023a.
- Knowledge distillation via the target-aware transformer. In CVPR, pages 10915–10924, 2022b.
- Collaborative static and dynamic vision-language streams for spatio-temporal video grounding. In CVPR, pages 23100–23109, 2023b.
- Ordinal regression with multiple output cnn for age estimation. In CVPR, pages 4920–4928, 2016.
- Relational knowledge distillation. In CVPR, pages 3967–3976, 2019.
- Recent advances in video question answering: A review of datasets and methods. In ICPR, pages 339–356, 2021.
- Video question answering with iterative video-text co-tokenization. In ECCV, pages 76–94, 2022.
- Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975.
- Model compression via distillation and quantization. In ICLR, 2018.
- Fitnets: Hints for thin deep nets. In ICLR, 2014.
- Hierarchical semantic correspondence networks for video paragraph grounding. In CVPR, pages 18973–18982, 2023.
- Movieqa: Understanding stories in movies through question-answering. In CVPR, pages 4631–4640, 2016.
- Learning situation hyper-graphs for video question answering. In CVPR, pages 14879–14889, 2023.
- Cédric Villani et al. Optimal transport: old and new. 2009.
- Sparsenerf: Distilling depth ranking for few-shot novel view synthesis. In ICCV, 2023a.
- All in one: Exploring unified video-language pre-training. In CVPR, pages 6598–6608, 2023b.
- The lambdaloss framework for ranking metric optimization. In CIKM, pages 1313–1322, 2018.
- Progressive teacher-student learning for early action prediction. In CVPR, pages 3556–3565, 2019.
- Listwise approach to learning to rank: theory and algorithm. In ICML, pages 1192–1199, 2008.
- Video graph transformer for video question answering. In ECCV, pages 39–58, 2022.
- Video question answering via gradually refined attention over appearance and motion. In ACM MM, pages 1645–1653, 2017.
- Multimodal knowledge expansion. In ICCV, pages 854–863, 2021.
- Just ask: Learning to answer questions from millions of narrated videos. In ICCV, pages 1686–1697, 2021.
- Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022a.
- Cross modality knowledge distillation for multi-modal aerial view object classification. In CVPR, pages 382–387, 2021.
- Can clicks be both labels and features? unbiased behavior feature collection and uncertainty-aware learning to rank. In ACM SIGIR, pages 6–17, 2022b.
- A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR, pages 4133–4141, 2017.
- Learning from multiple teacher networks. In ACM SIGKDD, pages 1285–1294, 2017.
- Learning from inside: Self-driven siamese sampling and reasoning for video question answering. In NeurIPS, pages 26462–26474, 2021.
- Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pages 9127–9134, 2019.
- Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2016.
- Merlot: Multimodal neural script knowledge models. In NeurIPS, 2021.
- Led: Lexicon-enlightened dense retriever for large-scale retrieval. In WWW, page 3203–3213, 2023.
- Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, pages 6848–6856, 2018.
- Decoupled knowledge distillation. In CVPR, pages 11953–11962, 2022.
- Video question answering: Datasets, algorithms and challenges. In ACL, pages 6439–6455, 2022.