A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention

Published 22 Jun 2020 in cs.LG and stat.ML | (2006.12065v4)

Abstract: We address the problem of learning on sets of features, motivated by the need of performing pooling operations in long biological sequences of varying sizes, with long-range dependencies, and possibly few labeled data. To address this challenging task, we introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference. Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost. Our aggregation technique admits two useful interpretations: it may be seen as a mechanism related to attention layers in neural networks, or it may be seen as a scalable surrogate of a classical optimal transport-based kernel. We experimentally demonstrate the effectiveness of our approach on biological sequences, achieving state-of-the-art results for protein fold recognition and detection of chromatin profiles tasks, and, as a proof of concept, we show promising results for processing natural language sequences. We provide an open-source implementation of our embedding that can be used alone or as a module in larger learning models at https://github.com/claying/OTK.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces OTKE, a novel approach leveraging optimal transport and kernel methods to create fixed-size embeddings from variable-length feature sets.
It demonstrates that OTKE effectively mirrors attention mechanisms, achieving state-of-the-art results in bioinformatics and NLP tasks.
The method supports both supervised and unsupervised learning, offering enhanced scalability, interpretability, and adaptability in sequence classification applications.

Overview of "A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention"

This paper presents a novel approach for learning on variable-sized sets of features, which is particularly relevant for domains such as bioinformatics and natural language processing where data includes sequences with potentially long-range dependencies and varying lengths. The proposed method, termed Optimal Transport Kernel Embedding (OTKE), constructs a parametrized representation that embeds and aggregates input feature sets through optimal transport (OT) plans between the features and a trainable reference.

Key Elements of OTKE

OTKE offers a representation that marries principles from OT theory and kernel methods, embodying the following key components:

Fixed-Size Representation: The proposed approach generates a fixed-size, trainable embedding for feature aggregation, enabling it to handle feature sets of diverse sizes.
Optimal Transport Plan: OTKE utilizes entropic regularization to solve the OT problem more efficiently, using the transport plan as a mechanism related to attention layers in neural networks or as a scalable surrogate of a classical OT-based kernel.
Kernel Approximation: By utilizing kernel approximation techniques, the method ensures scalability and provides a non-linear transformation of input features before aggregation.
Adaptability: The method allows for adaptive learning of the trainable reference, which can be optimized with or without labeled data, thus supporting both supervised and unsupervised learning paradigms.

Experimental Validation

The researchers conducted extensive experiments to validate the performance of OTKE across biological sequence classification tasks, such as protein fold recognition and detection of chromatin profiles, and demonstrated its efficacy in processing natural language sequences. OTKE was shown to achieve state-of-the-art results in these bioinformatics tasks. Additionally, in certain NLP tasks, it outperformed strong baselines, suggesting its versatility across different types of sequence data.

Implications and Future Directions

The robust experimental results indicate several implications for the broader field of machine learning and AI:

Attention Mechanisms: The work positions OTKE as both a theoretical and practical complement to attention mechanisms used in models like transformers. By reducing computational complexity via shared references and offering an alternative pooling mechanism, OTKE can enhance efficiency and performance in handling lengthy and complex sequences.
Feature Aggregation: The integration of kernel methods with OT presents a novel way to perform pooling operations in deep learning architectures, thereby expanding the toolkit available for managing variable-length inputs.
Scalability and Interpretability: OTKE encourages the development of scalable methods with potential interpretability due to its clear connection with transport distances and kernel embeddings.

Future developments could include further exploration into multi-layered extensions of the OTKE model, as well as integrating OTKE into self-supervised learning frameworks to leverage large unlabeled datasets. Additionally, continued enhancements in learning efficient transport plans and additional integrations with state-of-the-art sequence models may further expand the capabilities of this method.

The open-source implementation of OTKE available at the provided GitHub repository serves as a resource for further research and application development, fostering innovation in fields that rely heavily on sequence data.

Markdown Report Issue