CTF-Net: Convolution & Transformer for Point Clouds
- CTF-Net is a deep neural architecture that jointly extracts local and global features from unordered point clouds.
- Its CT-block module uses bidirectional feature transmission to merge fine-grained convolutional details with holistic transformer context.
- The network demonstrates state-of-the-art performance on ModelNet40 and ShapeNetPart, excelling in classification and segmentation tasks.
A Convolutional Transform Feature Network (CTF-Net) is a deep neural architecture for learning expressive representations from unordered point clouds by jointly extracting and fusing fine-grained local and global geometric features. Its central module, the CT-block, couples a convolution-style local branch and a transformer-style global branch through bidirectional feature transmission, enabling mutual guidance between locality and context. The resulting network backbones—when stacked—demonstrate state-of-the-art effectiveness in point cloud classification and segmentation with compact computation (Guo et al., 2021).
1. Motivation and Architectural Rationale
Point clouds encode 3D geometry without spatial regularity. Local feature extraction algorithms (e.g., PointNet++, graph-CNNs) excel at capturing local shape but cannot encode global contextual dependencies. Conversely, transformer-based attention networks can learn holistic global relationships but struggle to retain detailed local priors. CTF-Net's innovation is to simultaneously and reciprocally learn both local and global features, mediated by lightweight feature transmission bridges within each CT-block, thereby ensuring maximally informative per-point descriptors.
2. CT-block: Joint Local and Global Feature Extraction
The CT-block is a dual-branch module designed for mutual enhancement of local and global point cloud features.
2.1 Convolution-branch (Local Feature Extractor)
- Input: , representing neighbor-grouped local features.
- Operations:
1. Sampling & Grouping: Farthest Point Sampling (FPS) selects points. For each, the nearest neighbors form . 2. Point-wise MLP ("conv₁"): . 3. Feature Transmission from Global Branch ("ft₂"): . 4. Second MLP ("conv₂"): . 5. Max-pooling: .
- Key formula:
2.2 Transformer-branch (Global Feature Extractor with Offset-Attention)
- Input: .
- Operations:
1. Linear projection: ; . 2. Attention scores: . 3. Offset-Attention (Laplacian normalization): Double channel softmax normalization. 4. Context aggregation: . 5. Residual block: .
- Key formula:
2.3 Feature Transmission Elements
Bidirectional connections ensure mutual guidance:
- : "local → global" up-samples local features using distance-weighted interpolation, MLP, BN.
- : "global → local" down-samples global features, matches sample indices and channels, MLP, BN.
3. Network Assembly for Classification and Segmentation
CT-blocks are assembled hierarchically to form CTF-Net backbones for classification and segmentation.
3.1 Classification Backbone
- Input: .
- Embedding: Local (FPS and MLP) and global (MLP) feature streams.
- CT-block stack: blocks; convolution-branch downsamples by 2 (), channels double per block.
- Heads:
- Local head: -pool , FC, produce scores.
- Global head: concatenate , -pool, FC for scores.
- Final output: Average between heads.
3.2 Segmentation Backbone (U-Net Style)
- Encoder: Identical to the classification stack.
- Decoder: Up-sample local features per level to points, add skip connections.
- Global features: Concatenate at final stage.
- Heads: FC layers for per-point local and global segmentation scores. Final output sums these scores.
4. Algorithmic Workflow and Pseudocode
CT-block Forward Pass
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Inputs: F_l_in (N_l×C_l_in), F_g_in (N_g×C_g_in) Outputs: F_l_out (N_l′×C_l_out), F_g_out (N_g×C_g_out) 1. sample_idx ← FPS(N_l→N_l′) 2. F_grp ← group_points(F_l_in, sample_idx, S) 3. F2 ← LBR1(F_grp) 4. F_g_proj ← ft2(F_g_in) 5. F2′ ← F2 + expand(F_g_proj to S-dim) 6. F3 ← LBR2(F2′) 7. F_l_out ← max_pool(F3, dim=neighbors) 8. F_loc_proj ← ft1(F2) 9. F_qkv ← linear_qkv(F_g_in + F_loc_proj) 10. A ← normalize_offset_attention(Q,K) 11. F_a ← A·V 12. F_g_out ← LBR(F_a − F_g_in) + F_g_in 13. return F_l_out, F_g_out |
CTFNet Forward Pass
1 2 3 4 5 6 7 8 9 |
Inputs: P ∈ℝ^(N×3)
1. F_l0 ← LocalEmbedMLP(P)
2. F_g0 ← GlobalEmbedMLP(P)
3. for i in 1…L do
4. F_li, F_gi ← CT_block_forward(F_l^{i−1}, F_g^{i−1})
5. end for
6. s_l ← fc_loc( max_pool(F_l^L) )
7. s_g ← fc_glob( max_pool(cat(F_g^1,F_g^2,F_g^3)) )
8. return (s_l + s_g) / 2 |
5. Hyper-parameters and Implementation Details
The network's design is characterized by:
- Number of CT-blocks:
- Neighborhood size: in convolution-branch (optimal trade-off w.r.t. FLOPs and accuracy)
- Transformer embedding dimension: , attention latent:
- MLP expansion: Local channel doubles per block (32→64→128)
- Loss function: Dual cross-entropy (classification or per-point segmentation)
- Optimizer: SGD, momentum=0.9, initial lr=1e-3, cosine annealing
- Data augmentation: Classification uses random z-rotation and jitter (); segmentation uses anisotropic scaling in and multi-scale testing
6. Empirical Results and Ablation Studies
CTF-Net achieves competitive state-of-the-art performance on ModelNet40 classification and ShapeNetPart segmentation benchmarks:
ModelNet40 Classification
| Method | OA | mAcc |
|---|---|---|
| PointNet++ | 91.9% | – |
| PCT | 93.2% | – |
| CT-block | 93.5% | 90.8% |
ShapeNetPart Segmentation
| Method | pIoU (single) | pIoU (multi) |
|---|---|---|
| PointNet++ | – | 85.1% |
| PCT | – | 86.4% |
| CT-block | 86.3% | 86.5% |
Ablation Study (ModelNet40 OA / ShapeNetPart pIoU)
| Variant | OA | pIoU |
|---|---|---|
| Convolution-branch only | 91.82% | 85.23% |
| Transformer-branch only | 91.75% | 85.51% |
| Both, no feature transmission | 92.59% | 85.70% |
| Full CT-block | 93.52% | 86.29% |
Hyper-parameter Trade-offs
- Varying : achieves the best accuracy/cost.
- Transformer dimension : optimal before overfitting and computational expense rise.
7. Comparative Analysis with Related Architectures
The CTF-Net’s simultaneous integration of convolution and attention parallels developments in other joint local-global point cloud processing networks. 3DCTN (Lu et al., 2022) utilizes hierarchical sampling, multi-scale graph convolution for local feature aggregation (LFA), and transformer-based global feature learning (GFL), with offset-attention and vector subtraction operators yielding optimal trade-offs. Both CTF-Net and 3DCTN demonstrate that coupling local grouping and global attention surpasses single-paradigm baselines in accuracy and efficiency.
A plausible implication is that mutual guidance between local and global branches, when supported by effective inter-branch feature transmission, is the architectural driver for the observed performance gains on point cloud benchmarks. Key distinctions include CTF-Net’s explicit dual-branch bridge approach at every block—contrasting with 3DCTN’s module-level alternation—and the demonstration that ablations removing either joint extraction or bidirectional transmission result in substantial accuracy degradation.
Summary
The Convolutional Transform Feature Network (CTF-Net), constructed from multiple CT-blocks, enables the joint learning and fusion of convolution-local and transformer-global features for unordered point cloud data. Bidirectional transmission mechanisms interleave local and global semantic information, resulting in dense, discriminative descriptors. Empirical evidence substantiates the superiority of this dual-branch architecture in classification and segmentation tasks compared to single-method networks, establishing CTF-Net as an efficient, high-performing backbone for geometric deep learning (Guo et al., 2021).