Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification (2403.10254v1)

Published 15 Mar 2024 in cs.CV, cs.IR, and cs.MM

Abstract: Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast, multi-modal object ReID utilizes complementary information from diverse modalities, showing great potentials for practical applications. However, previous methods may be easily affected by irrelevant backgrounds and usually ignore the modality gaps. To address above issues, we propose a novel learning framework named \textbf{EDITOR} to select diverse tokens from vision Transformers for multi-modal object ReID. We begin with a shared vision Transformer to extract tokenized features from different input modalities. Then, we introduce a Spatial-Frequency Token Selection (SFTS) module to adaptively select object-centric tokens with both spatial and frequency information. Afterwards, we employ a Hierarchical Masked Aggregation (HMA) module to facilitate feature interactions within and across modalities. Finally, to further reduce the effect of backgrounds, we propose a Background Consistency Constraint (BCC) and an Object-Centric Feature Refinement (OCFR). They are formulated as two new loss functions, which improve the feature discrimination with background suppression. As a result, our framework can generate more discriminative features for multi-modal object ReID. Extensive experiments on three multi-modal ReID benchmarks verify the effectiveness of our methods. The code is available at https://github.com/924973292/EDITOR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Token merging: Your vit but faster. In ICLR, 2022.
  2. Multi-level factorisation net for person re-identification. In CVPR, pages 2109–2118, 2018.
  3. Deep meta metric learning. In ICCV, pages 9547–9556, 2019.
  4. Unicat: Crafting a stronger fusion baseline for multimodal re-identification. arXiv preprint arXiv:2310.18812, 2023.
  5. Video person re-identification by temporal residual learning. TIP, 28(3):1366–1377, 2019.
  6. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  8. Adaptive token sampling for efficient vision transformers. In ECCV, pages 396–414. Springer, 2022.
  9. Generative and attentive fusion for multi-spectral vehicle re-identification. In ICSP, pages 1565–1572, 2022.
  10. Which tokens to use? investigating token reduction in vision transformers. In CVPR, pages 773–783, 2023.
  11. Transfg: A transformer architecture for fine-grained recognition. In AAAI, pages 852–860, 2022.
  12. Graph-based progressive fusion network for multi-modality vehicle re-identification. TITS, pages 1–17, 2023.
  13. Transreid: Transformer-based object re-identification. In ICCV, pages 15013–15022, 2021.
  14. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
  15. Sbsgan: Suppression of inter-domain background shift for person re-identification. In CVPR, pages 9527–9536, 2019.
  16. Segment anything. CVPR, 2023.
  17. Multi-spectral vehicle re-identification: A challenge. In AAAI, pages 11345–11353, 2020.
  18. Harmonious attention network for person re-identification. In CVPR, pages 2285–2294, 2018.
  19. A video is worth three views: Trigeminal transformers for video-based person re-identification. arXiv preprint arXiv:2104.01745, 2021a.
  20. Watching you: Global-guided reciprocal learning for video-based person re-identification. In CVPR, pages 13334–13343, 2021b.
  21. Deeply coupled convolution–transformer with spatial–temporal complementary learning for video-based person re-identification. TNNLS, 2023a.
  22. Video-based person re-identification with long short-term representation learning. In ICIG, pages 55–67. Springer, 2023b.
  23. Ts2-net: Token shift and selection transformer for text-video retrieval. In ECCV, pages 319–335. Springer, 2022.
  24. Swin transformer: Hierarchical vision transformer using shifted windows. In CVPR, pages 10012–10022, 2021c.
  25. Learning progressive modality-shared transformers for effective visible-infrared person re-identification. In AAAI, pages 1835–1843, 2023.
  26. Bag of tricks and a strong baseline for deep person re-identification. In CVPRW, pages 1487–1495, 2019.
  27. Stephane G Mallat. A theory for multiresolution signal decomposition: the wavelet representation. IEEE transactions on pattern analysis and machine intelligence, 11(7):674–693, 1989.
  28. Token pooling in vision transformers. arXiv preprint arXiv:2110.03860, 2021.
  29. Progressively hybrid transformer for multi-modal vehicle re-identification. Sensors, 23(9):4206, 2023.
  30. Multi-scale deep learning architectures for person re-identification. In ICCV, pages 5399–5408, 2017.
  31. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  32. Counterfactual attention learning for fine-grained visual categorization and re-identification. In ICCV, pages 1025–1034, 2021a.
  33. Dynamicvit: Efficient vision transformers with dynamic token sparsification. NIPS, 34:13937–13949, 2021b.
  34. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, pages 480–496, 2018.
  35. Circle loss: A unified perspective of pair similarity optimization. In CVPR, pages 6398–6407, 2020.
  36. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826, 2016.
  37. Eliminating background-bias for robust person re-identification. In CVPR, pages 5794–5803, 2018.
  38. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 9(11), 2008.
  39. Attention is all you need. NeurIPS, 30, 2017.
  40. Learning discriminative features with multiple granularities for person re-identification. In ACM MM, pages 274–282, 2018.
  41. Nformer: Robust person re-identification with neighbor transformer. In CVPR, pages 7297–7307, 2022a.
  42. Pyramid spatial-temporal aggregation for video-based person re-identification. In ICCV, pages 12026–12035, 2021.
  43. Top-reid: Multi-spectral object re-identification with token permutation. arXiv preprint arXiv:2312.09612, 2023.
  44. Interact, embed, and enlarge: Boosting modality-specific representations for multi-modal person re-identification. In AAAI, pages 2633–2641, 2022b.
  45. Learning convolutional multi-level transformers for image-based person re-identification. Visual Intelligence, 1(1):24, 2023.
  46. Top-k visual tokens transformer: Selecting tokens for visible-infrared person re-identification. In ICASSP, pages 1–5. IEEE, 2023.
  47. Deep learning for person re-identification: A survey and outlook. TPAMI, 44(6):2872–2893, 2021.
  48. Graft: Gradual fusion transformer for multimodal re-identification. arXiv preprint arXiv:2310.16856, 2023.
  49. Tf-clip: Learning text-free clip for video-based person re-identification, 2023.
  50. Hat: Hierarchical aggregation transformers for person re-identification. In ACM MM, pages 516–525, 2021.
  51. Pha: Patch-wise high-frequency augmentation for transformer-based person re-identification. In CVPR, pages 14133–14142, 2023.
  52. Heterogeneous relational complement for vehicle re-identification. In ICCV, pages 205–214, 2021.
  53. Robust multi-modality person re-identification. In AAAI, pages 3529–3537, 2021.
  54. Multi-spectral vehicle re-identification with cross-directional consistency network and a high-quality benchmark. arXiv preprint arXiv:2208.00632, 2022.
  55. Dynamic enhancement network for partial multi-modality person re-identification. arXiv preprint arXiv:2305.15762, 2023.
  56. Random erasing data augmentation. In AAAI, pages 13001–13008, 2020.
  57. Omni-scale feature learning for person re-identification. In ICCV, pages 3702–3712, 2019.
  58. Dual cross-attention learning for fine-grained visual categorization and object re-identification. In CVPR, pages 4692–4702, 2022a.
  59. Aaformer: Auto-aligned transformer for person re-identification. arXiv preprint arXiv:2104.00921, 2021.
  60. Pass: Part-aware self-supervised pre-training for person re-identification. In European Conference on Computer Vision, pages 198–214. Springer, 2022b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Pingping Zhang (69 papers)
  2. Yuhao Wang (144 papers)
  3. Yang Liu (2253 papers)
  4. Zhengzheng Tu (21 papers)
  5. Huchuan Lu (199 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.