Explicit positional encoding for the texture transformer branch

Develop and evaluate explicit positional encoding mechanisms for the transformer-based texture branch that aggregates per-face texture pixels into a token in semantic segmentation models for textured non-manifold 3D meshes, in order to assess their impact on segmentation accuracy and representation quality.

Background

The proposed architecture includes a transformer-based texture branch that summarizes all pixels associated with each mesh face into a learnable token without using explicit positional encoding. Transformers typically rely on positional information to model relationships among input elements, and the lack of such encoding may limit the branch’s ability to capture spatial structure within per-face texture patches.

In the conclusion, the authors identify the absence of explicit positional encoding in the texture branch as the most notable limitation and state that addressing it is left for future exploration, indicating a concrete, unresolved design and evaluation question.

References

The most notable limitation of our method is the absence of explicit positional encoding in the texture transformer branch, which we leave for future exploration.

Semantic Segmentation of Textured Non-manifold 3D Meshes using Transformers  (2604.01836 - Heidarianbaei et al., 2 Apr 2026) in Conclusion and Future Work