Pyramidal Shapley-Taylor Framework

Updated 5 February 2026

Pyramidal Shapley-Taylor Learning Framework is a method that enables fine-grained motion-language retrieval through hierarchical token compression and pyramidal alignment.
It leverages multi-stage contrastive learning and joint-segment alignment to improve accuracy in matching detailed human motion with natural language.
The framework incorporates Shapley-Taylor interaction attribution, providing enhanced interpretability by quantifying specific contributions of motion and text tokens.

The Pyramidal Shapley-Taylor (PST) Learning Framework is a methodology for fine-grained motion-language retrieval that models hierarchical cross-modal alignment between human motion sequences and natural language. It departs from global-centric paradigms by adopting a pyramidal process reflecting human motion perception, leveraging Shapley-Taylor interaction attribution for interpretability and precise alignment. PST combines hierarchical token compression, contrastive learning, and Shapley-Taylor-based interaction distillation, and demonstrates superior retrieval accuracy and interpretability on standard benchmarks (Chen et al., 29 Jan 2026).

1. Hierarchical Motion and Text Decomposition

PST represents a human motion sequence as a set of frames, each comprising $J$ body joints in $C$ -dimensional coordinates ( $C=3$ for 3D skeletons). Let $M = \{ m_t \in \mathbb{R}^{J \times C} \}_{t=1}^L$ denote a sequence of $L$ frames. At the base, the motion is flattened into $N_{int} = J \cdot L$ joint-stage tokens $\mathcal{J} = \{ j_k \}_{k=1...N_{int}}$ . These tokens are compressed into $N_{sgm}$ segment tokens $\mathcal{S} = \{ s_i \}_{i=1...N_{sgm}}$ using a token compressor that integrates convolutional layers, self-attention, and KNN-DPC clustering, with compression ratio $p = N_{sgm} / N_{int}$ (typically $p=0.25$ ). Segments are pooled further to generate a single global descriptor $g$ for the motion.

The textual description $t$ is tokenized into word tokens $T_{ont} = \{ w_i \}_i$ , compressed via an analogous token compressor into phrase-stage tokens $P_{sgm} = \{ p_j \}_j$ , and then pooled into a global text feature $h$ .

This multi-stage tokenization underpins the pyramidal processing of representation granularity: from joint/word-level, to segment/phrase-level, to global motion/text features.

2. Shapley-Taylor Interaction Attribution

Central to PST is the quantification of fine-grained cross-modal interactions using Shapley-Taylor interaction (STI) [Sundararajan et al., 2020]. For each pair of motion and text tokens $(e_m, e_t)$ , the second-order ( $r=2$ ) Shapley-Taylor interaction index $\phi_{(i,j)}$ is computed as the permutation-average contribution of including tokens $i$ (from motion) and $j$ (from text) to the final retrieval scoring function $F(\cdot)$ . Mathematically, for $N$ tokens in total and for each permutation $\pi$ , define $S_{<}$ as the set preceding both $i$ and $j$ ; then,

$\phi_{(i,j)} = \mathbb{E}_{\pi} \left[ F(S_{<} \cup \{i,j\}) - F(S_{<} \cup \{i\}) - F(S_{<} \cup \{j\}) + F(S_{<}) \right]$

Direct computation is intractable; thus, PST introduces an STI Estimation Head $H$ that approximates $\phi_{(i,j)}$ via Monte-Carlo sampling and is trained by minimizing

$L_{STI} = \mathrm{KL}[D^{m \rightarrow t} \parallel \hat{D}^{m \rightarrow t}] + \mathrm{KL}[ D^{t \rightarrow m} \parallel \hat{D}^{t \rightarrow m} ]$

where $D$ and $\hat{D}$ denote the softmax distributions of true and predicted $\phi$ values, respectively, facilitating efficient end-to-end learning of interaction attributions between fine-level tokens.

3. Pyramidal Multi-Level Alignment Strategy

PST's core mechanism is structured as a three-stage pyramid, each supervised with local contrastive objectives and interaction distillation:

Joint-Wise Alignment: At the lowest pyramid level, joint-stage motion tokens and word tokens are compared. Pairwise cosine similarities $s_{ij}^{(int)}$ are computed after projection, and an InfoNCE contrastive loss $L_c^{(int)}$ is applied. STI distillation ( $L_{STI}^{(int)}$ ) further aligns the STI Head with true Shapley-Taylor indices at this local scale.
Segment-Wise Alignment: Tokens are compressed into segments/phrases. Segment-stage similarities $s_{ij}^{(sgm)}$ are calculated; InfoNCE loss $L_c^{(sgm)}$ and STI distillation $L_{STI}^{(sgm)}$ are used. A consistency loss

$L_{KD} = \mathrm{KL}\left[\mathrm{Softmax}(s^{(int)}/\tau) \parallel \mathrm{Softmax}(s^{(sgm)}/\tau)\right]$

enforces knowledge distillation between levels.

Holistic Alignment: At the top, segment tokens are pooled to global motion/text descriptors; global similarity $s^{(hlt)}$ is used in $L_c^{(hlt)}$ . STI distillation is not performed at this level.

At each level, token transformation and compression leverage the token compressor stack—convolutional layers, LayerNorm, multi-head self-attention, KNN-DPC clustering, and further self-attention—to efficiently represent increasing abstraction.

4. Objective Function and Optimization

The total loss for PST integrates multiple objectives: $L_{total} = w_{int} \cdot \left[ L_c^{(int)} + \lambda_{int} L_{STI}^{(int)} \right] + w_{sgm} \cdot \left[ L_c^{(sgm)} + \lambda_{sgm} L_{STI}^{(sgm)} + \mu L_{KD} \right] + w_{hlt} \cdot L_c^{(hlt)}$ where $w_{int}$ , $w_{sgm}$ , $w_{hlt}$ control the pyramid level weights, $\lambda_{int}, \lambda_{sgm}$ weight the STI losses, and $\mu$ the knowledge distillation. Each contrastive loss $L_c$ adopts the InfoNCE formulation, measuring cross-modal retrieval accuracy within a batch: $L_c = - \sum_{i=1}^B \log \frac{ \exp(s(t_i, m_i)/T) }{ \sum_{j=1}^B \exp(s(t_i, m_j)/T) } + \log \frac{ \exp(s(t_i, m_i)/T) }{ \sum_{k=1}^B \exp(s(t_k, m_i)/T) }$ where $T$ is the temperature.

5. Model Architecture Components

PST comprises distinct architectural modules:

Module	Architecture Description
Motion Encoder	Vision Transformer (ViT) on spatio-temporal MotionPatches
Text Encoder	DistilBERT for contextual word embeddings
Projection Head	Two-layer MLP (GeLU activation) to unify token feature space
Token Compressor	Conv(3×1) → LayerNorm → Self-Attention → KNN-DPC clustering → (repeat)
STI Estimation Head	Conv(3×3) → ReLU → Self-Attention → Residual → Conv(3×3) → ReLU

The repeated application of the token compressor enables hierarchical abstraction. The STI Head is trained to predict Shapley-Taylor indices for pairs of tokens.

6. Empirical Performance

On standard motion-language retrieval benchmarks, PST demonstrates superior performance compared to prior state-of-the-art approaches using only global alignment or limited part awareness. On the HumanML3D dataset under the "All" protocol, Text→Motion Recall@1 improves from 10.80 to 12.45, and Motion→Text Recall@1 from 71.61 to 76.15 compared to MotionPatch. On KIT-ML, PST achieves Recall@1 of 16.01 (Text→Motion) versus 14.02 and Recall@1 of 56.83 (Motion→Text) versus 53.55. These improvements are consistent across various batch protocols and across broader metrics including Recall@2, @3, @5, @10, and MedR (Chen et al., 29 Jan 2026).

7. Interpretability via Shapley-Taylor Attribution

PST's explicit computation of Shapley-Taylor indices $\phi_{(i,j)}$ provides intrinsic interpretability. High $\phi$ values indicate strong correspondence between specific joint movements and linguistic tokens (e.g., "right knee bends" and "kneels"), both at single-joint/word and segment/phrase scales. Segment-level attributions reveal mapping between joint clusters and multi-word expressions. Visual heatmaps of $\phi$ scores elucidate temporal and spatial loci of model attention, enabling fine-grained diagnostic analyses and insights into the model's reasoning. This attribution-based transparency is absent from purely global contrastive frameworks, positioning PST as both empirically strong and interpretable (Chen et al., 29 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pyramidal Shapley-Taylor (PST) Learning Framework.