Papers
Topics
Authors
Recent
2000 character limit reached

CTF-Net: Convolution & Transformer for Point Clouds

Updated 24 November 2025
  • CTF-Net is a deep neural architecture that jointly extracts local and global features from unordered point clouds.
  • Its CT-block module uses bidirectional feature transmission to merge fine-grained convolutional details with holistic transformer context.
  • The network demonstrates state-of-the-art performance on ModelNet40 and ShapeNetPart, excelling in classification and segmentation tasks.

A Convolutional Transform Feature Network (CTF-Net) is a deep neural architecture for learning expressive representations from unordered point clouds by jointly extracting and fusing fine-grained local and global geometric features. Its central module, the CT-block, couples a convolution-style local branch and a transformer-style global branch through bidirectional feature transmission, enabling mutual guidance between locality and context. The resulting network backbones—when stacked—demonstrate state-of-the-art effectiveness in point cloud classification and segmentation with compact computation (Guo et al., 2021).

1. Motivation and Architectural Rationale

Point clouds encode 3D geometry without spatial regularity. Local feature extraction algorithms (e.g., PointNet++, graph-CNNs) excel at capturing local shape but cannot encode global contextual dependencies. Conversely, transformer-based attention networks can learn holistic global relationships but struggle to retain detailed local priors. CTF-Net's innovation is to simultaneously and reciprocally learn both local and global features, mediated by lightweight feature transmission bridges within each CT-block, thereby ensuring maximally informative per-point descriptors.

2. CT-block: Joint Local and Global Feature Extraction

The CT-block is a dual-branch module designed for mutual enhancement of local and global point cloud features.

2.1 Convolution-branch (Local Feature Extractor)

  • Input: F(i1)RN×CF_\ell^{(i-1)} \in \mathbb{R}^{N_\ell \times C_\ell}, representing neighbor-grouped local features.
  • Operations:

1. Sampling & Grouping: Farthest Point Sampling (FPS) selects NoutN_\ell^\text{out} points. For each, the SS nearest neighbors form F1RNout×S×CF_1 \in \mathbb{R}^{N_\ell^\text{out} \times S \times C_\ell}. 2. Point-wise MLP ("conv₁"): F2=LBR1(F1)F_2 = \mathrm{LBR}_1(F_1). 3. Feature Transmission from Global Branch ("ft₂"): F2=F2+ft2(Fg(i))F_2' = F_2 + \mathrm{ft}_2(F_g^{(i)}). 4. Second MLP ("conv₂"): F3=LBR2(F2)F_3 = \mathrm{LBR}_2(F_2'). 5. Max-pooling: F(i)=maxj=1..SF3[:,j,:]F_\ell^{(i)} = \max_{j=1..S} F_3[:, j, :].

  • Key formula:

    F(i)=conv2(conv1(F(i1))+ft2(Fg(i)))F_\ell^{(i)} = \text{conv}_2(\text{conv}_1(F_\ell^{(i-1)}) + \text{ft}_2(F_g^{(i)}))

2.2 Transformer-branch (Global Feature Extractor with Offset-Attention)

  • Input: FginRNg×deF_g^{\text{in}} \in \mathbb{R}^{N_g \times d_e}.
  • Operations:

1. Linear projection: [Q,K,V]=Fgin[Wq,Wk,Wv][Q, K, V] = F_g^{\text{in}} [W_q, W_k, W_v]; Q,KRNg×da,VRNg×deQ, K \in \mathbb{R}^{N_g \times d_a}, V \in \mathbb{R}^{N_g \times d_e}. 2. Attention scores: Aˉ=QK\bar{A} = Q K^\top. 3. Offset-Attention (Laplacian normalization): Double channel softmax normalization. 4. Context aggregation: Fa=AVF_a = A V. 5. Residual block: Fgout=LBR(FaFgin)+FginF_g^{\text{out}} = \mathrm{LBR}(F_a - F_g^{\text{in}}) + F_g^{\text{in}}.

  • Key formula:

    Fg(i)=trans(ft1(conv1(F(i1)))+Fg(i1))F_g^{(i)} = \mathrm{trans}( \mathrm{ft}_1(\mathrm{conv}_1(F_\ell^{(i-1)})) + F_g^{(i-1)} )

2.3 Feature Transmission Elements

Bidirectional connections ensure mutual guidance:

  • ft1\mathrm{ft}_1: "local → global" up-samples local features using distance-weighted interpolation, MLP, BN.
  • ft2\mathrm{ft}_2: "global → local" down-samples global features, matches sample indices and channels, MLP, BN.

3. Network Assembly for Classification and Segmentation

CT-blocks are assembled hierarchically to form CTF-Net backbones for classification and segmentation.

3.1 Classification Backbone

  • Input: PRN×3P \in \mathbb{R}^{N \times 3}.
  • Embedding: Local (FPS and MLP) and global (MLP) feature streams.
  • CT-block stack: L=3L=3 blocks; convolution-branch downsamples by 2 (NN/2N \rightarrow N/2), channels double per block.
  • Heads:
    • Local head: max\max-pool FLF_\ell^{L}, FC, produce scores.
    • Global head: concatenate Fg1,Fg2,Fg3F_g^{1}, F_g^{2}, F_g^{3}, max\max-pool, FC for scores.
    • Final output: Average between heads.

3.2 Segmentation Backbone (U-Net Style)

  • Encoder: Identical to the classification stack.
  • Decoder: Up-sample local features per level to NN points, add skip connections.
  • Global features: Concatenate Fg1..3F_g^{1..3} at final stage.
  • Heads: FC layers for per-point local and global segmentation scores. Final output sums these scores.

4. Algorithmic Workflow and Pseudocode

CT-block Forward Pass

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Inputs: F_l_in (N_l×C_l_in), F_g_in (N_g×C_g_in)
Outputs: F_l_out (N_l′×C_l_out), F_g_out (N_g×C_g_out)
1. sample_idx ← FPS(N_l→N_l′)
2. F_grp ← group_points(F_l_in, sample_idx, S)
3. F2 ← LBR1(F_grp)
4. F_g_proj ← ft2(F_g_in)
5. F2′ ← F2 + expand(F_g_proj to S-dim)
6. F3 ← LBR2(F2′)
7. F_l_out ← max_pool(F3, dim=neighbors)
8. F_loc_proj ← ft1(F2)
9. F_qkv ← linear_qkv(F_g_in + F_loc_proj)
10. A ← normalize_offset_attention(Q,K)
11. F_a ← A·V
12. F_g_out ← LBR(F_a − F_g_in) + F_g_in
13. return F_l_out, F_g_out

CTFNet Forward Pass

1
2
3
4
5
6
7
8
9
Inputs: P ∈ℝ^(N×3)
1. F_l0 ← LocalEmbedMLP(P)
2. F_g0 ← GlobalEmbedMLP(P)
3. for i in 1…L do
4.    F_li, F_gi ← CT_block_forward(F_l^{i−1}, F_g^{i−1})
5. end for
6. s_l ← fc_loc( max_pool(F_l^L) )
7. s_g ← fc_glob( max_pool(cat(F_g^1,F_g^2,F_g^3)) )
8. return (s_l + s_g) / 2

5. Hyper-parameters and Implementation Details

The network's design is characterized by:

  • Number of CT-blocks: L=3L=3
  • Neighborhood size: S=32S=32 in convolution-branch (optimal trade-off w.r.t. FLOPs and accuracy)
  • Transformer embedding dimension: de=256d_e=256, attention latent: da=64d_a=64
  • MLP expansion: Local channel doubles per block (32→64→128)
  • Loss function: Dual cross-entropy (classification or per-point segmentation)
  • Optimizer: SGD, momentum=0.9, initial lr=1e-3, cosine annealing
  • Data augmentation: Classification uses random z-rotation and jitter (σ=0.02\sigma=0.02); segmentation uses anisotropic scaling in [0.8,1.25][0.8, 1.25] and multi-scale testing

6. Empirical Results and Ablation Studies

CTF-Net achieves competitive state-of-the-art performance on ModelNet40 classification and ShapeNetPart segmentation benchmarks:

ModelNet40 Classification

Method OA mAcc
PointNet++ 91.9%
PCT 93.2%
CT-block 93.5% 90.8%

ShapeNetPart Segmentation

Method pIoU (single) pIoU (multi)
PointNet++ 85.1%
PCT 86.4%
CT-block 86.3% 86.5%

Ablation Study (ModelNet40 OA / ShapeNetPart pIoU)

Variant OA pIoU
Convolution-branch only 91.82% 85.23%
Transformer-branch only 91.75% 85.51%
Both, no feature transmission 92.59% 85.70%
Full CT-block 93.52% 86.29%

Hyper-parameter Trade-offs

  • Varying S{8,16,32,64,128}S \in \{8,16,32,64,128\}: S32S \approx 32 achieves the best accuracy/cost.
  • Transformer dimension de{64,128,256,512,1024}d_e \in \{64,128,256,512,1024\}: de=256d_e=256 optimal before overfitting and computational expense rise.

The CTF-Net’s simultaneous integration of convolution and attention parallels developments in other joint local-global point cloud processing networks. 3DCTN (Lu et al., 2022) utilizes hierarchical sampling, multi-scale graph convolution for local feature aggregation (LFA), and transformer-based global feature learning (GFL), with offset-attention and vector subtraction operators yielding optimal trade-offs. Both CTF-Net and 3DCTN demonstrate that coupling local grouping and global attention surpasses single-paradigm baselines in accuracy and efficiency.

A plausible implication is that mutual guidance between local and global branches, when supported by effective inter-branch feature transmission, is the architectural driver for the observed performance gains on point cloud benchmarks. Key distinctions include CTF-Net’s explicit dual-branch bridge approach at every block—contrasting with 3DCTN’s module-level alternation—and the demonstration that ablations removing either joint extraction or bidirectional transmission result in substantial accuracy degradation.

Summary

The Convolutional Transform Feature Network (CTF-Net), constructed from multiple CT-blocks, enables the joint learning and fusion of convolution-local and transformer-global features for unordered point cloud data. Bidirectional transmission mechanisms interleave local and global semantic information, resulting in dense, discriminative descriptors. Empirical evidence substantiates the superiority of this dual-branch architecture in classification and segmentation tasks compared to single-method networks, establishing CTF-Net as an efficient, high-performing backbone for geometric deep learning (Guo et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Convolutional Transform Feature Network (CTF-Net).