Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Meta-Personalizing Vision-Language Models to Find Named Instances in Video (2306.10169v1)

Published 16 Jun 2023 in cs.CV, cs.CL, and cs.LG

Abstract: Large-scale vision-LLMs (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they currently struggle with personalized searches for moments in a video where a specific object instance such as ``My dog Biscuit'' appears. We present the following three contributions to address this problem. First, we describe a method to meta-personalize a pre-trained VLM, i.e., learning how to learn to personalize a VLM at test time to search in video. Our method extends the VLM's token vocabulary by learning novel word embeddings specific to each instance. To capture only instance-specific features, we represent each instance embedding as a combination of shared and learned global category features. Second, we propose to learn such personalization without explicit human supervision. Our approach automatically identifies moments of named visual instances in video using transcripts and vision-language similarity in the VLM's embedding space. Finally, we introduce This-Is-My, a personal video instance retrieval benchmark. We evaluate our approach on This-Is-My and DeepFashion2 and show that we obtain a 15% relative improvement over the state of the art on the latter dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Flamingo: a visual language model for few-shot learning. ArXiv, abs/2204.14198, 2022.
  2. Artwork personalization at netflix. In Proceedings of the 12th ACM conference on recommender systems, pages 487–488, 2018.
  3. Personalized recommender system for e-learning environment. Education and Information Technologies, 22(4):1455–1477, 2017.
  4. Instance-conditioned gan. Advances in Neural Information Processing Systems, 34:27517–27529, 2021.
  5. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  6. “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations. In ECCV, 2022.
  7. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
  8. Online meta-learning. In International Conference on Machine Learning, pages 1920–1930. PMLR, 2019.
  9. An image is worth one word: Personalizing text-to-image generation using textual inversion. ArXiv, abs/2208.01618, 2022.
  10. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
  11. Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5337–5345, 2019.
  12. X-pool: Cross-modal language-video attention for text-video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5006–5015, 2022.
  13. Imagen video: High definition video generation with diffusion models. ArXiv, abs/2210.02303, 2022.
  14. Finding ”it”: Weakly-supervised reference-aware visual grounding in instructional videos. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5948–5957, 2018.
  15. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  16. Improving federated learning personalization via model agnostic meta learning. arXiv preprint arXiv:1909.12488, 2019.
  17. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.
  18. Mdmmt-2: Multidomain multimodal transformer for video retrieval, one more step towards generalization. arXiv preprint arXiv:2203.07086, 2022.
  19. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7331–7341, 2021.
  20. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
  21. Composing ensembles of pre-trained models via iterative consensus. arXiv preprint arXiv:2210.11522, 2022.
  22. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  23. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  24. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
  25. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pages 638–647, 2022.
  26. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  27. Robust speech recognition via large-scale weak supervision. Technical report, Tech. Rep., Technical report, OpenAI, 2022.
  28. Hierarchical text-conditional image generation with clip latents. ArXiv, abs/2204.06125, 2022.
  29. Zero-shot text-to-image generation. ArXiv, abs/2102.12092, 2021.
  30. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
  31. Photorealistic text-to-image diffusion models with deep language understanding. ArXiv, abs/2205.11487, 2022.
  32. Make-a-video: Text-to-video generation without text-video data. ArXiv, abs/2209.14792, 2022.
  33. Collie: Continual learning of language grounding from language-image embeddings. Journal of Artificial Intelligence Research, 74:1201–1223, 2022.
  34. Transnet v2: An effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838, 2020.
  35. Test-time training with self-supervision for generalization under distribution shifts. In ICML, 2020.
  36. Phenaki: Variable length video generation from open domain textual description. ArXiv, abs/2210.02399, 2022.
  37. Fairseq s2t: Fast speech-to-text modeling with fairseq. In AACL, 2020.
  38. On-target adaptation. ArXiv, abs/2109.01087, 2021.
  39. Tent: Fully test-time adaptation by entropy minimization. In ICLR, 2021.
  40. Dualprompt: Complementary prompting for rehearsal-free continual learning. In ECCV, 2022.
  41. Explore and match: End-to-end video grounding with transformer. arXiv preprint arXiv:2201.10168, 2022.
  42. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022.
  43. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021.
  44. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 585–601, 2018.
  45. Clip-vip: Adapting pre-trained image-text model to video-language representation alignment. arXiv preprint arXiv:2209.06430, 2022.
  46. Tubedetr: Spatio-temporal video grounding with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16442–16453, 2022.
  47. Zero-shot video question answering via frozen bidirectional language models. arXiv preprint arXiv:2206.08155, 2022.
  48. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020.
  49. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Chun-Hsiao Yeh (7 papers)
  2. Bryan Russell (36 papers)
  3. Josef Sivic (78 papers)
  4. Fabian Caba Heilbron (34 papers)
  5. Simon Jenni (25 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.