TF-CLIP: Learning Text-free CLIP for Video-based Person Re-Identification (2312.09627v1)
Abstract: Large-scale language-image pre-trained models (e.g., CLIP) have shown superior performances on many cross-modal retrieval tasks. However, the problem of transferring the knowledge learned from such models to video-based person re-identification (ReID) has barely been explored. In addition, there is a lack of decent text descriptions in current ReID benchmarks. To address these issues, in this work, we propose a novel one-stage text-free CLIP-based learning framework named TF-CLIP for video-based person ReID. More specifically, we extract the identity-specific sequence feature as the CLIP-Memory to replace the text feature. Meanwhile, we design a Sequence-Specific Prompt (SSP) module to update the CLIP-Memory online. To capture temporal information, we further propose a Temporal Memory Diffusion (TMD) module, which consists of two key components: Temporal Memory Construction (TMC) and Memory Diffusion (MD). Technically, TMC allows the frame-level memories in a sequence to communicate with each other, and to extract temporal information based on the relations within the sequence. MD further diffuses the temporal memories to each token in the original features to obtain more robust sequence features. Extensive experiments demonstrate that our proposed method shows much better results than other state-of-the-art methods on MARS, LS-VID and iLIDS-VID. The code is available at https://github.com/AsuradaYuci/TF-CLIP.
- Salient-to-broad transition for video person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7339–7348.
- Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE International Conference on Computer Vision, 357–366.
- Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1169–1178.
- Video person re-identification by temporal residual learning. IEEE Transactions on Image Processing, 28(3): 1366–1377.
- Video Person Re-Identification by Temporal Residual Learning. IEEE Transactions on Image Processing, 28: 1366–1377.
- Video-based person re-identification with spatial and temporal memory networks. In Proceedings of the IEEE International Conference on Computer Vision, 12036–12045.
- Sta: Spatial-temporal attention for large-scale video-based person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, 8287–8294.
- Revisiting temporal modeling for video-based person reid. arXiv preprint arXiv:1805.02104.
- Motion feature aggregation for video-based person re-identification. IEEE Transactions on Image Processing, 31: 3908–3919.
- Appearance-preserving 3d convolution for video-based person re-identification. In Proceedings of the European Conference on Computer Vision, 228–243.
- Transfg: A transformer architecture for fine-grained recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 852–860.
- Dense interaction learning for video-based person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, 1490–1501.
- In defense of the triplet loss for person re-identification. arXiv:1703.07737.
- BiCnet-TKS: Learning efficient spatial-temporal representation for video person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014–2023.
- Temporal complementary learning for video person re-identification. In Proceedings of the European Conference on Computer Vision, 388–405.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 4904–4916. PMLR.
- Prompting visual-language models for efficient video understanding. In Proceedings of the European Conference on Computer Vision, 105–124. Springer.
- Maple: Multi-modal prompt learning. arXiv preprint arXiv:2210.03117.
- Adam: A method for stochastic optimization. arXiv:1412.6980.
- Global-local temporal representations for video person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, 3958–3967.
- Multi-scale 3d convolution network for video based person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, 8618–8625.
- CLIP-ReID: Exploiting vision-language model for image re-identification without concrete text labels. arXiv preprint arXiv:2211.13977.
- Spatial-temporal correlation and topology learning for person re-identification in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4370–4379.
- A spatio-temporal appearance representation for viceo-based pedestrian re-identification. In Proceedings of the IEEE International Conference on Computer Vision, 3810–3818.
- Deeply-coupled convolution-transformer with spatial-temporal complementary learning for video-based person re-identification. arXiv preprint arXiv:2304.14122.
- Video-based Person Re-identification with Long Short-Term Representation Learning. arXiv preprint arXiv:2308.03703.
- A video is worth three views: Trigeminal transformers for video-based person re-identification. arXiv:2104.01745.
- Watching you: Global-guided reciprocal learning for video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 13334–13343.
- Spatial and temporal mutual promotion for video-based person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, 8786–8793.
- Recurrent convolutional network for video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1325–1334.
- Expanding language-image pretrained models for general video recognition. In Proceedings of the European Conference on Computer Vision, 1–18. Springer.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763. PMLR.
- Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 18082–18091.
- Fine-tuned CLIP models are efficient video learners. arXiv preprint arXiv:2212.03640.
- Co-segmentation inspired attention networks for video-based person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, 562–572.
- Multi-stage spatio-temporal aggregation transformer for video person re-identification. arXiv preprint arXiv:2301.00531.
- Visualizing data using t-SNE. Journal of Machine Learning Research, 9: 2579–2605.
- Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472.
- Person Re-identification by video ranking. In Proceedings of the European Conference on Computer Vision, 688–703.
- Pyramid spatial-temporal aggregation for video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 12026–12035.
- CAViT: Contextual alignment vision transformer for video object re-identification. In Proceedings of the European Conference on Computer Vision, 549–566. Springer.
- Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5177–5186.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084.
- Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, 4733–4742.
- Learning multi-granular hypergraphs for video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2899–2908.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.
- Multidirection and multiscale pyramid in transformer for video-based pedestrian retrieval. IEEE Transactions on Industrial Informatics, 18(12): 8776–8785.
- Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11101–11111.
- Spatiotemporal transformer for video-based person re-identification. arXiv:2103.16469.
- Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 10407–10416.
- Mars: A video benchmark for large-scale person re-identification. In Proceedings of the European Conference on Computer Vision, 868–884.
- Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, 13001–13008.
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 16816–16825.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.
- Chenyang Yu (14 papers)
- Xuehu Liu (8 papers)
- Yingquan Wang (3 papers)
- Pingping Zhang (69 papers)
- Huchuan Lu (199 papers)