UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity (2312.03441v6)
Abstract: Existing text-based person retrieval datasets often have relatively coarse-grained text annotations. This hinders the model to comprehend the fine-grained semantics of query texts in real scenarios. To address this problem, we contribute a new benchmark named \textbf{UFineBench} for text-based person retrieval with ultra-fine granularity. Firstly, we construct a new \textbf{dataset} named UFine6926. We collect a large number of person images and manually annotate each image with two detailed textual descriptions, averaging 80.8 words each. The average word count is three to four times that of the previous datasets. In addition of standard in-domain evaluation, we also propose a special \textbf{evaluation paradigm} more representative of real scenarios. It contains a new evaluation set with cross domains, cross textual granularity and cross textual styles, named UFine3C, and a new evaluation metric for accurately measuring retrieval ability, named mean Similarity Distribution (mSD). Moreover, we propose CFAM, a more efficient \textbf{algorithm} especially designed for text-based person retrieval with ultra fine-grained texts. It achieves fine granularity mining by adopting a shared cross-modal granularity decoder and hard negative match mechanism. With standard in-domain evaluation, CFAM establishes competitive performance across various datasets, especially on our ultra fine-grained UFine6926. Furthermore, by evaluating on UFine3C, we demonstrate that training on our UFine6926 significantly improves generalization to real scenarios compared with other coarse-grained datasets. The dataset and code will be made publicly available at \url{https://github.com/Zplusdragon/UFineBench}.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- Tipcb: A simple but effective part-based convolutional baseline for text-based person search. Neurocomputing, 494:171–181, 2022.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Upar challenge: Pedestrian attribute recognition and attribute-based person retrieval–dataset, design, and results. In WACV, pages 166–175, 2023.
- Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv preprint arXiv:2107.12666, 2021.
- Large-scale pre-training for person re-identification with noisy labels. In CVPR, pages 2476–2486, 2022.
- Dsa-pr: discrete soft biometric attribute-based person retrieval in surveillance videos. In AVSS, pages 1–7. IEEE, 2021.
- Person retrieval in surveillance videos using attribute recognition. Journal of Ambient Intelligence and Humanized Computing, pages 1–13, 2022.
- Contextual non-local alignment over full-scale representation for text-based person search. arXiv preprint arXiv:2101.03036, 2021.
- Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In CVPR, 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Fine-grained semantically aligned vision-language pre-training. NeurIPS, 35:7290–7303, 2022a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900. PMLR, 2022b.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
- Person search with natural language description. In CVPR, pages 1970–1979, 2017.
- Improving description-based person re-identification by multi-granularity image-text alignments. IEEE TIP, 29:5542–5556, 2020a.
- Textual dependency embedding for person search by language. In ACM MM, pages 4032–4040, 2020b.
- OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, 2023a.
- OpenAI. Gpt-4 technical report, 2023b.
- Training language models to follow instructions with human feedback. NeurIPS, 35:27730–27744, 2022.
- Fine-grained image-text matching by cross-modal hard aligning network. In CVPR, pages 19275–19284, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Adversarial representation learning for text-to-image matching. In ICCV, pages 5814–5824, 2019.
- Attribute-based person retrieval and search in video sequences. In AVSS, pages 1–6. IEEE, 2018.
- Learning granularity-unified representations for text-to-image person re-identification. In ACM MM, pages 5566–5574, 2022.
- Attribute based spatio-temporal person retrieval in video surveillance. Alexandria Engineering Journal, 63:441–454, 2023.
- See finer, see more: Implicit modality alignment for text-based person retrieval, 2022.
- Improving attribute-based person retrieval by using a calibrated, weighted, and distribution-based distance metric. In ICIP, pages 2378–2382. IEEE, 2021.
- Upar: Unified pedestrian attribute recognition and person retrieval. In WACV, pages 981–990, 2023.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision. IEEE TMM, 2022a.
- Attribute-wise reasoning reinforcement learning for pedestrian attribute retrieval. International Journal of Multimedia Information Retrieval, 12(2):35, 2023.
- Vitaa: Visual-textual attributes alignment in person search by natural language. In ECCV, pages 402–420. Springer, 2020.
- Caibc: Capturing all-round information beyond color for text-based person retrieval. In ACM MM, pages 5314–5322, 2022b.
- Look before you leap: Improving text-based person retrieval by learning a consistent cross-modal common manifold. In ACM MM, pages 1984–1992, 2022c.
- Person transfer gan to bridge domain gap for person re-identification. In CVPR, pages 79–88, 2018.
- Lapscore: language-guided person search via color reasoning. In ICCV, pages 1624–1633, 2021.
- Clip-driven fine-grained text-image person re-identification. IEEE TIP, 2023.
- Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
- Deep cross-modal projection learning for image-text matching. In ECCV, pages 686–701, 2018.
- Fairmot: On the fairness of detection and re-identification in multiple object tracking. IJCV, 129:3069–3087, 2021.
- Scalable person re-identification: A benchmark. In ICCV, pages 1116–1124, 2015.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In ICCV, pages 3774–3782, 2017.
- Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(2):1–23, 2020.
- Dssl: Deep surroundings-person separation learning for text-based person retrieval. In ACM MM, pages 209–217, 2021.
- Plip: Language-image pre-training for person representation learning. arXiv preprint arXiv:2305.08386, 2023.
- Jialong Zuo (22 papers)
- Hanyu Zhou (19 papers)
- Ying Nie (15 papers)
- Feng Zhang (180 papers)
- Tianyu Guo (33 papers)
- Nong Sang (86 papers)
- Yunhe Wang (145 papers)
- Changxin Gao (76 papers)