Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity (2312.03441v6)

Published 6 Dec 2023 in cs.CV

Abstract: Existing text-based person retrieval datasets often have relatively coarse-grained text annotations. This hinders the model to comprehend the fine-grained semantics of query texts in real scenarios. To address this problem, we contribute a new benchmark named \textbf{UFineBench} for text-based person retrieval with ultra-fine granularity. Firstly, we construct a new \textbf{dataset} named UFine6926. We collect a large number of person images and manually annotate each image with two detailed textual descriptions, averaging 80.8 words each. The average word count is three to four times that of the previous datasets. In addition of standard in-domain evaluation, we also propose a special \textbf{evaluation paradigm} more representative of real scenarios. It contains a new evaluation set with cross domains, cross textual granularity and cross textual styles, named UFine3C, and a new evaluation metric for accurately measuring retrieval ability, named mean Similarity Distribution (mSD). Moreover, we propose CFAM, a more efficient \textbf{algorithm} especially designed for text-based person retrieval with ultra fine-grained texts. It achieves fine granularity mining by adopting a shared cross-modal granularity decoder and hard negative match mechanism. With standard in-domain evaluation, CFAM establishes competitive performance across various datasets, especially on our ultra fine-grained UFine6926. Furthermore, by evaluating on UFine3C, we demonstrate that training on our UFine6926 significantly improves generalization to real scenarios compared with other coarse-grained datasets. The dataset and code will be made publicly available at \url{https://github.com/Zplusdragon/UFineBench}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  2. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
  3. Tipcb: A simple but effective part-based convolutional baseline for text-based person search. Neurocomputing, 494:171–181, 2022.
  4. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  5. Upar challenge: Pedestrian attribute recognition and attribute-based person retrieval–dataset, design, and results. In WACV, pages 166–175, 2023.
  6. Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv preprint arXiv:2107.12666, 2021.
  7. Large-scale pre-training for person re-identification with noisy labels. In CVPR, pages 2476–2486, 2022.
  8. Dsa-pr: discrete soft biometric attribute-based person retrieval in surveillance videos. In AVSS, pages 1–7. IEEE, 2021.
  9. Person retrieval in surveillance videos using attribute recognition. Journal of Ambient Intelligence and Humanized Computing, pages 1–13, 2022.
  10. Contextual non-local alignment over full-scale representation for text-based person search. arXiv preprint arXiv:2101.03036, 2021.
  11. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In CVPR, 2023.
  12. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  13. Fine-grained semantically aligned vision-language pre-training. NeurIPS, 35:7290–7303, 2022a.
  14. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900. PMLR, 2022b.
  15. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  16. Person search with natural language description. In CVPR, pages 1970–1979, 2017.
  17. Improving description-based person re-identification by multi-granularity image-text alignments. IEEE TIP, 29:5542–5556, 2020a.
  18. Textual dependency embedding for person search by language. In ACM MM, pages 4032–4040, 2020b.
  19. OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, 2023a.
  20. OpenAI. Gpt-4 technical report, 2023b.
  21. Training language models to follow instructions with human feedback. NeurIPS, 35:27730–27744, 2022.
  22. Fine-grained image-text matching by cross-modal hard aligning network. In CVPR, pages 19275–19284, 2023.
  23. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  24. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  25. Adversarial representation learning for text-to-image matching. In ICCV, pages 5814–5824, 2019.
  26. Attribute-based person retrieval and search in video sequences. In AVSS, pages 1–6. IEEE, 2018.
  27. Learning granularity-unified representations for text-to-image person re-identification. In ACM MM, pages 5566–5574, 2022.
  28. Attribute based spatio-temporal person retrieval in video surveillance. Alexandria Engineering Journal, 63:441–454, 2023.
  29. See finer, see more: Implicit modality alignment for text-based person retrieval, 2022.
  30. Improving attribute-based person retrieval by using a calibrated, weighted, and distribution-based distance metric. In ICIP, pages 2378–2382. IEEE, 2021.
  31. Upar: Unified pedestrian attribute recognition and person retrieval. In WACV, pages 981–990, 2023.
  32. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  33. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  35. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  36. Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision. IEEE TMM, 2022a.
  37. Attribute-wise reasoning reinforcement learning for pedestrian attribute retrieval. International Journal of Multimedia Information Retrieval, 12(2):35, 2023.
  38. Vitaa: Visual-textual attributes alignment in person search by natural language. In ECCV, pages 402–420. Springer, 2020.
  39. Caibc: Capturing all-round information beyond color for text-based person retrieval. In ACM MM, pages 5314–5322, 2022b.
  40. Look before you leap: Improving text-based person retrieval by learning a consistent cross-modal common manifold. In ACM MM, pages 1984–1992, 2022c.
  41. Person transfer gan to bridge domain gap for person re-identification. In CVPR, pages 79–88, 2018.
  42. Lapscore: language-guided person search via color reasoning. In ICCV, pages 1624–1633, 2021.
  43. Clip-driven fine-grained text-image person re-identification. IEEE TIP, 2023.
  44. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
  45. Deep cross-modal projection learning for image-text matching. In ECCV, pages 686–701, 2018.
  46. Fairmot: On the fairness of detection and re-identification in multiple object tracking. IJCV, 129:3069–3087, 2021.
  47. Scalable person re-identification: A benchmark. In ICCV, pages 1116–1124, 2015.
  48. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  49. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In ICCV, pages 3774–3782, 2017.
  50. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(2):1–23, 2020.
  51. Dssl: Deep surroundings-person separation learning for text-based person retrieval. In ACM MM, pages 209–217, 2021.
  52. Plip: Language-image pre-training for person representation learning. arXiv preprint arXiv:2305.08386, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jialong Zuo (22 papers)
  2. Hanyu Zhou (19 papers)
  3. Ying Nie (15 papers)
  4. Feng Zhang (180 papers)
  5. Tianyu Guo (33 papers)
  6. Nong Sang (86 papers)
  7. Yunhe Wang (145 papers)
  8. Changxin Gao (76 papers)
Citations (11)

Summary

Analysis of UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity

The paper "UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity" presents a novel approach to address significant limitations in existing text-based person retrieval frameworks, particularly concerning the granularity of textual descriptions. The researchers introduce a new benchmark, UFineBench, which leverages fine-grained annotations to enhance retrieval tasks, enabling models to better comprehend complex query semantics reflective of real-world applications.

Overview

The authors identify a gap in current datasets, which often exhibit coarse-grained annotations, typically resulting in algorithmic degradation into attribute-based retrieval. To resolve this, they present UFine6926, a dataset containing 6,926 identities with extensive textual descriptions, averaging 80.8 words per image, significantly extending the descriptive detail compared to previous works. The dataset draws images from diverse, unconstrained sources and incorporates meticulous manual annotation to ensure high-quality, detailed text-to-image mappings.

Furthermore, the paper introduces UFine3C, an evaluation set designed to more accurately reflect real-world conditions via cross-domain, cross-textual granularity, and cross-textual styles, better preparing models for the variability found in practice. A novel metric, mean Similarity Distribution (mSD), is proposed to address deficiencies in existing evaluation methods that rely on discrete rank measures, thus offering a more nuanced analysis of retrieval performance by leveraging continuous similarity distributions.

Methodology

The paper advances a new framework, the Cross-modal Fine-grained Aligning and Matching (CFAM), which capitalizes on shared cross-modal granularity decoders and a hard negative match mechanism to achieve superior model performance. The CFAM framework demonstrates strong retrieval capabilities across multiple datasets by enhancing both local and global alignment of visual and textual data through meticulously designed interaction and learning strategies.

Empirical Evaluation

The evaluations presented showcase CFAM's competitive performance across both in-domain and cross-domain scenarios, with particular emphasis on the associated gains derived from the newly introduced UFine6926 dataset. Notably, CFAM's adaptability is underscored through its robust generalization across diverse datasets, signifying its potential utility in real-world settings characterized by significant variability and noise.

Implications and Future Directions

This research not only sets a foundation for improved text-based person retrieval through fine-grained descriptors but also opens new avenues for AI applications that demand high precision in understanding human-centric query semantics. The introduction of the UFineBench framework and associated methodologies highlights the nuanced interplay required between sophisticated data annotation and algorithmic innovation.

Moving forward, the insights gleaned from this research could spur further advancements in the development of multimodal frameworks, particularly those that seek to leverage ultra-fine granularity in contexts such as surveillance, personalized recommender systems, and human-computer interaction. Future investigations might explore integration with larger, more diverse data sets, or the incorporation of advanced neural network architectures to further optimize retrieval accuracy and computational efficiency.

In sum, the contributions of this paper enrich the discourse on text-based person retrieval by advocating for a paradigm shift towards granularity, precision, and contextual understanding, thereby advancing the theoretical and practical utility of AI in this domain.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com