Pyramidal Shapley-Taylor Framework
- Pyramidal Shapley-Taylor Learning Framework is a method that enables fine-grained motion-language retrieval through hierarchical token compression and pyramidal alignment.
- It leverages multi-stage contrastive learning and joint-segment alignment to improve accuracy in matching detailed human motion with natural language.
- The framework incorporates Shapley-Taylor interaction attribution, providing enhanced interpretability by quantifying specific contributions of motion and text tokens.
The Pyramidal Shapley-Taylor (PST) Learning Framework is a methodology for fine-grained motion-language retrieval that models hierarchical cross-modal alignment between human motion sequences and natural language. It departs from global-centric paradigms by adopting a pyramidal process reflecting human motion perception, leveraging Shapley-Taylor interaction attribution for interpretability and precise alignment. PST combines hierarchical token compression, contrastive learning, and Shapley-Taylor-based interaction distillation, and demonstrates superior retrieval accuracy and interpretability on standard benchmarks (Chen et al., 29 Jan 2026).
1. Hierarchical Motion and Text Decomposition
PST represents a human motion sequence as a set of frames, each comprising body joints in -dimensional coordinates ( for 3D skeletons). Let denote a sequence of frames. At the base, the motion is flattened into joint-stage tokens . These tokens are compressed into segment tokens using a token compressor that integrates convolutional layers, self-attention, and KNN-DPC clustering, with compression ratio (typically ). Segments are pooled further to generate a single global descriptor for the motion.
The textual description is tokenized into word tokens , compressed via an analogous token compressor into phrase-stage tokens , and then pooled into a global text feature .
This multi-stage tokenization underpins the pyramidal processing of representation granularity: from joint/word-level, to segment/phrase-level, to global motion/text features.
2. Shapley-Taylor Interaction Attribution
Central to PST is the quantification of fine-grained cross-modal interactions using Shapley-Taylor interaction (STI) [Sundararajan et al., 2020]. For each pair of motion and text tokens , the second-order () Shapley-Taylor interaction index is computed as the permutation-average contribution of including tokens (from motion) and (from text) to the final retrieval scoring function . Mathematically, for tokens in total and for each permutation , define as the set preceding both and ; then,
Direct computation is intractable; thus, PST introduces an STI Estimation Head that approximates via Monte-Carlo sampling and is trained by minimizing
where and denote the softmax distributions of true and predicted values, respectively, facilitating efficient end-to-end learning of interaction attributions between fine-level tokens.
3. Pyramidal Multi-Level Alignment Strategy
PST's core mechanism is structured as a three-stage pyramid, each supervised with local contrastive objectives and interaction distillation:
- Joint-Wise Alignment: At the lowest pyramid level, joint-stage motion tokens and word tokens are compared. Pairwise cosine similarities are computed after projection, and an InfoNCE contrastive loss is applied. STI distillation () further aligns the STI Head with true Shapley-Taylor indices at this local scale.
- Segment-Wise Alignment: Tokens are compressed into segments/phrases. Segment-stage similarities are calculated; InfoNCE loss and STI distillation are used. A consistency loss
enforces knowledge distillation between levels.
- Holistic Alignment: At the top, segment tokens are pooled to global motion/text descriptors; global similarity is used in . STI distillation is not performed at this level.
At each level, token transformation and compression leverage the token compressor stack—convolutional layers, LayerNorm, multi-head self-attention, KNN-DPC clustering, and further self-attention—to efficiently represent increasing abstraction.
4. Objective Function and Optimization
The total loss for PST integrates multiple objectives: where , , control the pyramid level weights, weight the STI losses, and the knowledge distillation. Each contrastive loss adopts the InfoNCE formulation, measuring cross-modal retrieval accuracy within a batch: where is the temperature.
5. Model Architecture Components
PST comprises distinct architectural modules:
| Module | Architecture Description |
|---|---|
| Motion Encoder | Vision Transformer (ViT) on spatio-temporal MotionPatches |
| Text Encoder | DistilBERT for contextual word embeddings |
| Projection Head | Two-layer MLP (GeLU activation) to unify token feature space |
| Token Compressor | Conv(3×1) → LayerNorm → Self-Attention → KNN-DPC clustering → (repeat) |
| STI Estimation Head | Conv(3×3) → ReLU → Self-Attention → Residual → Conv(3×3) → ReLU |
The repeated application of the token compressor enables hierarchical abstraction. The STI Head is trained to predict Shapley-Taylor indices for pairs of tokens.
6. Empirical Performance
On standard motion-language retrieval benchmarks, PST demonstrates superior performance compared to prior state-of-the-art approaches using only global alignment or limited part awareness. On the HumanML3D dataset under the "All" protocol, Text→Motion Recall@1 improves from 10.80 to 12.45, and Motion→Text Recall@1 from 71.61 to 76.15 compared to MotionPatch. On KIT-ML, PST achieves Recall@1 of 16.01 (Text→Motion) versus 14.02 and Recall@1 of 56.83 (Motion→Text) versus 53.55. These improvements are consistent across various batch protocols and across broader metrics including Recall@2, @3, @5, @10, and MedR (Chen et al., 29 Jan 2026).
7. Interpretability via Shapley-Taylor Attribution
PST's explicit computation of Shapley-Taylor indices provides intrinsic interpretability. High values indicate strong correspondence between specific joint movements and linguistic tokens (e.g., "right knee bends" and "kneels"), both at single-joint/word and segment/phrase scales. Segment-level attributions reveal mapping between joint clusters and multi-word expressions. Visual heatmaps of scores elucidate temporal and spatial loci of model attention, enabling fine-grained diagnostic analyses and insights into the model's reasoning. This attribution-based transparency is absent from purely global contrastive frameworks, positioning PST as both empirically strong and interpretable (Chen et al., 29 Jan 2026).