QAConv-QA: Query Adaptive Convolution for ReID
- The paper introduces a module that integrates pixel-level importance weighting with bidirectional consistency, enhancing identity matching under severe clothing changes.
- It fuses RGB and parsing-based features through multi-modal attention and uses dynamic query-adaptive convolution to generate robust, clothing-invariant representations.
- Evaluations on PRCC, LTCC, and VC-Clothes benchmarks show significant Top-1 and mAP improvements, validating the method's effectiveness in CC-ReID.
Quality-Aware Query-Adaptive Convolution (QAConv-QA) is a module designed to enhance pixel-level matching within the dual-branch QA-ReID architecture, targeting the challenges of person re-identification (ReID) under severe clothing changes. QAConv-QA introduces two critical mechanisms—pixel-level importance weighting and explicit bidirectional consistency constraints—that together facilitate robust identity correspondence even as superficial appearance varies with clothing. This approach proves essential in clothes-changing ReID (CC-ReID), a setting characterized by strong intra-person appearance shifts.
1. Role of QAConv-QA in the Dual-Branch QA-ReID Framework
QAConv-QA is embedded within the two-branch backbone of QA-ReID, which utilizes complementary cues from RGB images and clothing-invariant structural features. The RGB branch extracts feature maps from the full image using ResNet-50 up to stage 3, producing . The parsing branch applies a human-parsing network to produce a body-part mask , removing the clothing regions to form and generating .
A multi-modal attention fusion module computes a joint attention map , blending and into a fused feature map via:
QAConv-QA directly operates on these fused features at the pixel level, comparing query and gallery images (, ) through a sequence of similarity calculations, weighting, and aggregation, followed by a post-processing head (bidirectional global max pooling batch norm MLP sigmoid) to yield match probabilities (Wang et al., 27 Jan 2026).
2. Pixel-Level Importance Weighting
Each spatial location on the feature map receives a quality score reflecting the likelihood that it lies on an identity-relevant (typically non-clothing) region. The score is computed as the fraction of the corresponding input patch covered by the body-part mask, and is normalized by a spatial softmax: The pairwise cosine similarity between query and gallery pixel features , is then re-weighted: This mechanism prioritizes features that localize to identity-stable, body-based regions and suppresses the influence of clothing-related areas.
3. Bidirectional Consistency Constraints
To further enhance reliability in pixel-level matching, QAConv-QA introduces explicit bidirectional consistency. Conditional softmaxes are defined over feature locations, establishing the probability that a given pixel in one sample is the best match for a pixel in the other, and vice versa:
The bidirectional-consistent similarity takes the product: Aggregating over all pixel pairs with bidirectional global maximum pooling yields a scalar score , emphasizing only mutually top-matching, identity-consistent region pairs.
4. Query-Adaptive Convolution and Dynamic Filtering
QAConv-QA adopts a dynamic filter paradigm inspired by the original QAConv formulation [Shengcai Liao & Ling Shao, ECCV 2020], where each query pixel feature acts as a convolutional filter upon the gallery feature map:
This equates to a full query-gallery location similarity matrix. Computation is optimized batch-wise using im2col and einsum. After initial cosine similarity, QAConv-QA systematically applies pixel-level reweighting and bidirectional consistency as above.
5. Integration with Multi-Modal Fusion and Forward Pass Workflow
The QAConv-QA module relies on multi-modal fusion of RGB and parsing-based features, providing joint representations for matching. The forward pass, as outlined in the implementation, comprises: fused feature extraction, pixel weight calculation, pairwise cosine similarity computation, quality reweighting, dual-direction softmax normalization, computation of bidirectionally consistent similarity, aggregation via Bi-GMP, and final post-processing through batch normalization, MLP, and sigmoid activation. The following summarizes the computational sequence:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Fq = fuse_branch(RGB_q, Parse_q)
Fg = fuse_branch(RGB_g, Parse_g)
Qq = compute_pixel_weights(ParseMask_q)
Qg = compute_pixel_weights(ParseMask_g)
S_raw = cosine_similarity(Fq, Fg)
S1 = outer(Qq, Qg) * S_raw
P_q2g = softmax_over_query_locs(S1)
P_g2q = softmax_over_gallery_locs(S1)
S2 = P_q2g * P_g2q
Sagg_q2g = max_{(i,j)} max_{(p,q)} S2
Sagg_g2q = max_{(p,q)} max_{(i,j)} S2
Sagg = (Sagg_q2g + Sagg_g2q)/2
p = sigmoid(MLP(BN(Sagg))) |
6. Supervision and Training Loss Composition
The QA-ReID framework integrates three types of losses:
- Identity classification loss on globally pooled features of each branch,
- Triplet loss operating over these embeddings, and
- Binary cross-entropy matching loss on pixel-level pairwise scores from QAConv-QA.
The total loss is given by: This composite objective enforces both global structural identity constraints and fine-grained local alignment under varied clothing.
7. Performance in Clothes-Changing ReID
On challenging CC-ReID benchmarks—PRCC, LTCC, and VC-Clothes—QA-ReID augmented with QAConv-QA achieves state-of-the-art results under clothing-changing protocols:
| Dataset | Top-1 Gain | mAP Gain |
|---|---|---|
| PRCC | +6.9% | +3.9% |
| LTCC | +0.7% | +1.9% |
| VC-Clothes | +3.0% | +2.8% |
Ablation studies isolate the contributions of the two QAConv-QA blocks: pixel weighting alone yields +1.6% Top-1 (PRCC), bidirectional matching alone +0.7%, with the full combination providing +3.1% improvement.
Visualization of QAConv-QA attention maps demonstrates that the model attends chiefly to identity-stable regions—such as the head and limbs—rather than clothing-variant areas, confirming the intended focus on semantically stable cues (Wang et al., 27 Jan 2026).
In sum, QAConv-QA imparts quality-aware, mutual pixel-level filtering to query-adaptive convolution, crucially advancing robust ReID performance amid drastic clothing transitions through a unified mechanism of feature fusion, spatial weighting, and tightly enforced mutual consistency.