Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels (2403.14430v1)

Published 21 Mar 2024 in cs.CV

Abstract: This paper focuses on open-ended video question answering, which aims to find the correct answers from a large answer set in response to a video-related question. This is essentially a multi-label classification task, since a question may have multiple answers. However, due to annotation costs, the labels in existing benchmarks are always extremely insufficient, typically one answer per question. As a result, existing works tend to directly treat all the unlabeled answers as negative labels, leading to limited ability for generalization. In this work, we introduce a simple yet effective ranking distillation framework (RADI) to mitigate this problem without additional manual annotation. RADI employs a teacher model trained with incomplete labels to generate rankings for potential answers, which contain rich knowledge about label priority as well as label-associated visual cues, thereby enriching the insufficient labeling information. To avoid overconfidence in the imperfect teacher model, we further present two robust and parameter-free ranking distillation approaches: a pairwise approach which introduces adaptive soft margins to dynamically refine the optimization constraints on various pairwise rankings, and a listwise approach which adopts sampling-based partial listwise learning to resist the bias in teacher ranking. Extensive experiments on five popular benchmarks consistently show that both our pairwise and listwise RADIs outperform state-of-the-art methods. Further analysis demonstrates the effectiveness of our methods on the insufficient labeling problem.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. A stochastic treatment of learning to rank scoring functions. In WSDM, pages 61–69, 2020.
  2. Learning to rank using gradient descent. In ICML, pages 89–96, 2005.
  3. Learning to rank with nonsmooth cost functions. NeurIPS, 19, 2006.
  4. Learning to rank: from pairwise approach to listwise approach. In ICML, pages 129–136, 2007.
  5. Using the future to” sort out” the present: rankprop and multitask learning for medical risk evaluation. In NeurIPS, pages 959–965, 1995.
  6. Plot: Prompt learning with optimal transport for vision-language models. In ICLR, 2022.
  7. Not all relevance scores are equal: Efficient uncertainty and calibration modeling for deep retrieval models. In ACM SIGIR, pages 654–664, 2021.
  8. Pranking with ranking. In NeurIPS, pages 641–647, 2001.
  9. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 26, 2013.
  10. Hierarchical object-oriented spatio-temporal reasoning for video question answering. In IJCAI, pages 636–642, 2021.
  11. Heterogeneous memory enhanced multimodal attention model for video question answering. In CVPR, pages 1999–2007, 2019.
  12. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
  13. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, pages 1050–1059, 2016.
  14. Ota: Optimal transport assignment for object detection. In CVPR, pages 303–312, 2021.
  15. Knowledge distillation: A survey. IJCV, 129:1789–1819, 2021.
  16. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
  17. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  18. Parameter-efficient transfer learning for nlp. In ICML, pages 2790–2799, 2019.
  19. Location-aware graph convolutional networks for video question answering. In AAAI, pages 11021–11028, 2020.
  20. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR, pages 2758–2766, 2017.
  21. Cumulated gain-based evaluation of ir techniques. TOIS, 20(4):422–446, 2002.
  22. Paraphrasing complex network: Network compression via factor transfer. In NeurIPS, 2018.
  23. Hierarchical conditional relation networks for video question answering. In CVPR, pages 9972–9981, 2020.
  24. Tvqa: Localized, compositional video question answering. In EMNLP, page 1369, 2018.
  25. Less is more: Clipbert for video-and-language learning via sparse sampling. In CVPR, pages 7331–7341, 2021.
  26. Hero: Hierarchical encoder for video+ language omni-representation pre-training. In EMNLP, pages 2046–2065, 2020.
  27. Svitt: Temporal learning of sparse video-text transformers. In CVPR, pages 18919–18929, 2023a.
  28. Curriculum temperature for knowledge distillation. In AAAI, pages 1504–1512, 2023b.
  29. Adversarial partial domain adaptation by cycle inconsistency. In ECCV, pages 530–548, 2022a.
  30. Diversifying spatial-temporal perception for video domain generalization. In NeurIPS, 2023a.
  31. Knowledge distillation via the target-aware transformer. In CVPR, pages 10915–10924, 2022b.
  32. Collaborative static and dynamic vision-language streams for spatio-temporal video grounding. In CVPR, pages 23100–23109, 2023b.
  33. Ordinal regression with multiple output cnn for age estimation. In CVPR, pages 4920–4928, 2016.
  34. Relational knowledge distillation. In CVPR, pages 3967–3976, 2019.
  35. Recent advances in video question answering: A review of datasets and methods. In ICPR, pages 339–356, 2021.
  36. Video question answering with iterative video-text co-tokenization. In ECCV, pages 76–94, 2022.
  37. Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975.
  38. Model compression via distillation and quantization. In ICLR, 2018.
  39. Fitnets: Hints for thin deep nets. In ICLR, 2014.
  40. Hierarchical semantic correspondence networks for video paragraph grounding. In CVPR, pages 18973–18982, 2023.
  41. Movieqa: Understanding stories in movies through question-answering. In CVPR, pages 4631–4640, 2016.
  42. Learning situation hyper-graphs for video question answering. In CVPR, pages 14879–14889, 2023.
  43. Cédric Villani et al. Optimal transport: old and new. 2009.
  44. Sparsenerf: Distilling depth ranking for few-shot novel view synthesis. In ICCV, 2023a.
  45. All in one: Exploring unified video-language pre-training. In CVPR, pages 6598–6608, 2023b.
  46. The lambdaloss framework for ranking metric optimization. In CIKM, pages 1313–1322, 2018.
  47. Progressive teacher-student learning for early action prediction. In CVPR, pages 3556–3565, 2019.
  48. Listwise approach to learning to rank: theory and algorithm. In ICML, pages 1192–1199, 2008.
  49. Video graph transformer for video question answering. In ECCV, pages 39–58, 2022.
  50. Video question answering via gradually refined attention over appearance and motion. In ACM MM, pages 1645–1653, 2017.
  51. Multimodal knowledge expansion. In ICCV, pages 854–863, 2021.
  52. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, pages 1686–1697, 2021.
  53. Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022a.
  54. Cross modality knowledge distillation for multi-modal aerial view object classification. In CVPR, pages 382–387, 2021.
  55. Can clicks be both labels and features? unbiased behavior feature collection and uncertainty-aware learning to rank. In ACM SIGIR, pages 6–17, 2022b.
  56. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR, pages 4133–4141, 2017.
  57. Learning from multiple teacher networks. In ACM SIGKDD, pages 1285–1294, 2017.
  58. Learning from inside: Self-driven siamese sampling and reasoning for video question answering. In NeurIPS, pages 26462–26474, 2021.
  59. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pages 9127–9134, 2019.
  60. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2016.
  61. Merlot: Multimodal neural script knowledge models. In NeurIPS, 2021.
  62. Led: Lexicon-enlightened dense retriever for large-scale retrieval. In WWW, page 3203–3213, 2023.
  63. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, pages 6848–6856, 2018.
  64. Decoupled knowledge distillation. In CVPR, pages 11953–11962, 2022.
  65. Video question answering: Datasets, algorithms and challenges. In ACL, pages 6439–6455, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.