Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pyramidal Shapley-Taylor Framework

Updated 5 February 2026
  • Pyramidal Shapley-Taylor Learning Framework is a method that enables fine-grained motion-language retrieval through hierarchical token compression and pyramidal alignment.
  • It leverages multi-stage contrastive learning and joint-segment alignment to improve accuracy in matching detailed human motion with natural language.
  • The framework incorporates Shapley-Taylor interaction attribution, providing enhanced interpretability by quantifying specific contributions of motion and text tokens.

The Pyramidal Shapley-Taylor (PST) Learning Framework is a methodology for fine-grained motion-language retrieval that models hierarchical cross-modal alignment between human motion sequences and natural language. It departs from global-centric paradigms by adopting a pyramidal process reflecting human motion perception, leveraging Shapley-Taylor interaction attribution for interpretability and precise alignment. PST combines hierarchical token compression, contrastive learning, and Shapley-Taylor-based interaction distillation, and demonstrates superior retrieval accuracy and interpretability on standard benchmarks (Chen et al., 29 Jan 2026).

1. Hierarchical Motion and Text Decomposition

PST represents a human motion sequence as a set of frames, each comprising JJ body joints in CC-dimensional coordinates (C=3C=3 for 3D skeletons). Let M={mtRJ×C}t=1LM = \{ m_t \in \mathbb{R}^{J \times C} \}_{t=1}^L denote a sequence of LL frames. At the base, the motion is flattened into Nint=JLN_{int} = J \cdot L joint-stage tokens J={jk}k=1...Nint\mathcal{J} = \{ j_k \}_{k=1...N_{int}}. These tokens are compressed into NsgmN_{sgm} segment tokens S={si}i=1...Nsgm\mathcal{S} = \{ s_i \}_{i=1...N_{sgm}} using a token compressor that integrates convolutional layers, self-attention, and KNN-DPC clustering, with compression ratio p=Nsgm/Nintp = N_{sgm} / N_{int} (typically p=0.25p=0.25). Segments are pooled further to generate a single global descriptor gg for the motion.

The textual description tt is tokenized into word tokens Tont={wi}iT_{ont} = \{ w_i \}_i, compressed via an analogous token compressor into phrase-stage tokens Psgm={pj}jP_{sgm} = \{ p_j \}_j, and then pooled into a global text feature hh.

This multi-stage tokenization underpins the pyramidal processing of representation granularity: from joint/word-level, to segment/phrase-level, to global motion/text features.

2. Shapley-Taylor Interaction Attribution

Central to PST is the quantification of fine-grained cross-modal interactions using Shapley-Taylor interaction (STI) [Sundararajan et al., 2020]. For each pair of motion and text tokens (em,et)(e_m, e_t), the second-order (r=2r=2) Shapley-Taylor interaction index ϕ(i,j)\phi_{(i,j)} is computed as the permutation-average contribution of including tokens ii (from motion) and jj (from text) to the final retrieval scoring function F()F(\cdot). Mathematically, for NN tokens in total and for each permutation π\pi, define S<S_{<} as the set preceding both ii and jj; then,

ϕ(i,j)=Eπ[F(S<{i,j})F(S<{i})F(S<{j})+F(S<)]\phi_{(i,j)} = \mathbb{E}_{\pi} \left[ F(S_{<} \cup \{i,j\}) - F(S_{<} \cup \{i\}) - F(S_{<} \cup \{j\}) + F(S_{<}) \right]

Direct computation is intractable; thus, PST introduces an STI Estimation Head HH that approximates ϕ(i,j)\phi_{(i,j)} via Monte-Carlo sampling and is trained by minimizing

LSTI=KL[DmtD^mt]+KL[DtmD^tm]L_{STI} = \mathrm{KL}[D^{m \rightarrow t} \parallel \hat{D}^{m \rightarrow t}] + \mathrm{KL}[ D^{t \rightarrow m} \parallel \hat{D}^{t \rightarrow m} ]

where DD and D^\hat{D} denote the softmax distributions of true and predicted ϕ\phi values, respectively, facilitating efficient end-to-end learning of interaction attributions between fine-level tokens.

3. Pyramidal Multi-Level Alignment Strategy

PST's core mechanism is structured as a three-stage pyramid, each supervised with local contrastive objectives and interaction distillation:

  • Joint-Wise Alignment: At the lowest pyramid level, joint-stage motion tokens and word tokens are compared. Pairwise cosine similarities sij(int)s_{ij}^{(int)} are computed after projection, and an InfoNCE contrastive loss Lc(int)L_c^{(int)} is applied. STI distillation (LSTI(int)L_{STI}^{(int)}) further aligns the STI Head with true Shapley-Taylor indices at this local scale.
  • Segment-Wise Alignment: Tokens are compressed into segments/phrases. Segment-stage similarities sij(sgm)s_{ij}^{(sgm)} are calculated; InfoNCE loss Lc(sgm)L_c^{(sgm)} and STI distillation LSTI(sgm)L_{STI}^{(sgm)} are used. A consistency loss

LKD=KL[Softmax(s(int)/τ)Softmax(s(sgm)/τ)]L_{KD} = \mathrm{KL}\left[\mathrm{Softmax}(s^{(int)}/\tau) \parallel \mathrm{Softmax}(s^{(sgm)}/\tau)\right]

enforces knowledge distillation between levels.

  • Holistic Alignment: At the top, segment tokens are pooled to global motion/text descriptors; global similarity s(hlt)s^{(hlt)} is used in Lc(hlt)L_c^{(hlt)}. STI distillation is not performed at this level.

At each level, token transformation and compression leverage the token compressor stack—convolutional layers, LayerNorm, multi-head self-attention, KNN-DPC clustering, and further self-attention—to efficiently represent increasing abstraction.

4. Objective Function and Optimization

The total loss for PST integrates multiple objectives: Ltotal=wint[Lc(int)+λintLSTI(int)]+wsgm[Lc(sgm)+λsgmLSTI(sgm)+μLKD]+whltLc(hlt)L_{total} = w_{int} \cdot \left[ L_c^{(int)} + \lambda_{int} L_{STI}^{(int)} \right] + w_{sgm} \cdot \left[ L_c^{(sgm)} + \lambda_{sgm} L_{STI}^{(sgm)} + \mu L_{KD} \right] + w_{hlt} \cdot L_c^{(hlt)} where wintw_{int}, wsgmw_{sgm}, whltw_{hlt} control the pyramid level weights, λint,λsgm\lambda_{int}, \lambda_{sgm} weight the STI losses, and μ\mu the knowledge distillation. Each contrastive loss LcL_c adopts the InfoNCE formulation, measuring cross-modal retrieval accuracy within a batch: Lc=i=1Blogexp(s(ti,mi)/T)j=1Bexp(s(ti,mj)/T)+logexp(s(ti,mi)/T)k=1Bexp(s(tk,mi)/T)L_c = - \sum_{i=1}^B \log \frac{ \exp(s(t_i, m_i)/T) }{ \sum_{j=1}^B \exp(s(t_i, m_j)/T) } + \log \frac{ \exp(s(t_i, m_i)/T) }{ \sum_{k=1}^B \exp(s(t_k, m_i)/T) } where TT is the temperature.

5. Model Architecture Components

PST comprises distinct architectural modules:

Module Architecture Description
Motion Encoder Vision Transformer (ViT) on spatio-temporal MotionPatches
Text Encoder DistilBERT for contextual word embeddings
Projection Head Two-layer MLP (GeLU activation) to unify token feature space
Token Compressor Conv(3×1) → LayerNorm → Self-Attention → KNN-DPC clustering → (repeat)
STI Estimation Head Conv(3×3) → ReLU → Self-Attention → Residual → Conv(3×3) → ReLU

The repeated application of the token compressor enables hierarchical abstraction. The STI Head is trained to predict Shapley-Taylor indices for pairs of tokens.

6. Empirical Performance

On standard motion-language retrieval benchmarks, PST demonstrates superior performance compared to prior state-of-the-art approaches using only global alignment or limited part awareness. On the HumanML3D dataset under the "All" protocol, Text→Motion Recall@1 improves from 10.80 to 12.45, and Motion→Text Recall@1 from 71.61 to 76.15 compared to MotionPatch. On KIT-ML, PST achieves Recall@1 of 16.01 (Text→Motion) versus 14.02 and Recall@1 of 56.83 (Motion→Text) versus 53.55. These improvements are consistent across various batch protocols and across broader metrics including Recall@2, @3, @5, @10, and MedR (Chen et al., 29 Jan 2026).

7. Interpretability via Shapley-Taylor Attribution

PST's explicit computation of Shapley-Taylor indices ϕ(i,j)\phi_{(i,j)} provides intrinsic interpretability. High ϕ\phi values indicate strong correspondence between specific joint movements and linguistic tokens (e.g., "right knee bends" and "kneels"), both at single-joint/word and segment/phrase scales. Segment-level attributions reveal mapping between joint clusters and multi-word expressions. Visual heatmaps of ϕ\phi scores elucidate temporal and spatial loci of model attention, enabling fine-grained diagnostic analyses and insights into the model's reasoning. This attribution-based transparency is absent from purely global contrastive frameworks, positioning PST as both empirically strong and interpretable (Chen et al., 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pyramidal Shapley-Taylor (PST) Learning Framework.