Graph Mamba: Towards Learning on Graphs with State Space Models (2402.08678v2)

Published 13 Feb 2024 in cs.LG

Abstract: Graph Neural Networks (GNNs) have shown promising potential in graph representation learning. The majority of GNNs define a local message-passing mechanism, propagating information over the graph by stacking multiple layers. These methods, however, are known to suffer from two major limitations: over-squashing and poor capturing of long-range dependencies. Recently, Graph Transformers (GTs) emerged as a powerful alternative to Message-Passing Neural Networks (MPNNs). GTs, however, have quadratic computational cost, lack inductive biases on graph structures, and rely on complex Positional/Structural Encodings (SE/PE). In this paper, we show that while Transformers, complex message-passing, and SE/PE are sufficient for good performance in practice, neither is necessary. Motivated by the recent success of State Space Models (SSMs), such as Mamba, we present Graph Mamba Networks (GMNs), a general framework for a new class of GNNs based on selective SSMs. We discuss and categorize the new challenges when adapting SSMs to graph-structured data, and present four required and one optional steps to design GMNs, where we choose (1) Neighborhood Tokenization, (2) Token Ordering, (3) Architecture of Bidirectional Selective SSM Encoder, (4) Local Encoding, and dispensable (5) PE and SE. We further provide theoretical justification for the power of GMNs. Experiments demonstrate that despite much less computational cost, GMNs attain an outstanding performance in long-range, small-scale, large-scale, and heterophilic benchmark datasets.

PDF Abstract

Graph Mamba Networks (GMNs) (Behrouz et al., 13 Feb 2024 ) are presented as a novel class of graph neural networks (GNNs) that leverage the power of selective State Space Models (SSMs), particularly inspired by the Mamba architecture. The paper aims to address the limitations of existing dominant paradigms in graph representation learning: Message-Passing Neural Networks (MPNNs) and Graph Transformers (GTs). MPNNs suffer from issues like over-squashing, over-smoothing, and poor long-range dependency capture, while GTs, despite their expressive power and ability to model long-range interactions via global attention, face quadratic computational costs, lack strong inductive biases, and rely heavily on complex and often expensive Positional/Structural Encodings (PE/SE).

The core idea behind GMNs is to adapt selective SSMs, which have shown remarkable efficiency and performance in sequence modeling, to the complex, non-causal structure of graphs. The paper outlines a five-step recipe for designing GMNs:

Tokenization: Mapping the graph into a sequence of tokens. GMNs propose a flexible neighborhood sampling method based on random walks. For each node, a sequence of subgraphs is sampled, where each subgraph corresponds to the union of nodes visited by multiple random walks of a specific length ( $\hat{m}$ $\overset{m}{^}$ ). By varying the maximum walk length ( $m$ $m$ ) and the number of random walk sets ( $s$ $s$ ), this method can bridge node-level tokenization ( $m=0$ $m = 0$ ) and subgraph-level tokenization ( $m \ge 1$ $m \geq 1$ ), allowing the choice to be a tunable hyperparameter.
- Implementation Detail: For a node $v$ , sample $M$ random walks of length $\hat{m}$ for $\hat{m} = 0, \dots, m$ . Repeat this $s$ times. The tokens for node $v$ are induced subgraphs on the union of nodes visited by these walks for each ( $\hat{m}, \text{repetition}$ ) pair. This results in a sequence of $m \times s + 1$ tokens (including the node itself for $\hat{m}=0$ ).
(Optional) PE/SE: Incorporating structural and positional information. Similar to GTs, GMNs can augment initial node features with PE (e.g., Laplacian eigenvectors) or SE (e.g., Random-walk structural encodings). The paper shows that GMNs can achieve strong performance even without these, mitigating a major bottleneck for scalability compared to GTs.
Local Encoding: Vectorizing the sampled subgraph tokens. Each subgraph token is encoded into a feature vector using an encoder $\phi(\cdot)$ . This encoder can be an MPNN (like Gated-GCN) or a random walk-based feature encoder (like RWF).
Token Ordering: Arranging the sequence of tokens for consumption by the sequential SSM encoder. When using subgraph tokens ( $m \geq 1$ $m \geq 1$ ), the inherent hierarchy of $k$ $k$ -hop neighborhoods provides an implicit order. The paper suggests reversing this order (from largest to smallest hop) to allow inner subgraphs access to information from outer, more global, subgraphs. When using node tokenization ( $m=0$ $m = 0$ ), an explicit ordering (e.g., based on node degrees) is required to provide a sequence structure to the SSM.
- Implementation Detail: For subgraph tokens of node $v$ from walks of lengths $0, 1, \dots, m$ , repeated $s$ times, the sequence order is generally from largest hop length ( $m$ ) down to the node itself ( $\hat{m}=0$ ). For node tokens, sorting nodes by degree or other centrality measures provides the sequence order.
Bidirectional Selective SSM Encoder: The core architectural component, a modified Mamba block designed for non-causal graph data. It uses two recurrent scan modules, one processing the token sequence in the forward direction and the other in reverse. This allows information to flow in both directions, making the model more robust to the specific ordering of tokens, which is crucial for graph data. The selective mechanism allows the model to filter irrelevant information and focus on important tokens.
- Implementation Detail: A Bidirectional Mamba block takes a sequence of token embeddings. It applies two Mamba-like selective SSMs, one on the input sequence and one on the reversed sequence. The outputs from the two directions are combined (e.g., summed and projected) to produce the final output sequence of embeddings.
- Architecture: The overall GMN architecture stacks these bidirectional Mamba blocks. The initial layers process the sequence of neighborhood tokens for each node individually. The final layers treat the nodes themselves (represented by the output corresponding to the $\hat{m}=0$ token from the previous layers) as tokens and pass them through another bidirectional Mamba block (potentially augmented by an optional MPNN layer) to capture global dependencies across the graph.

The paper highlights key challenges when adapting SSMs to graphs compared to Transformers: the need for token ordering due to the sequential nature of SSMs vs. the permutation equivariance of Transformers; the opportunity to leverage SSM's ability to handle long sequences for more context (longer token sequences); and the fact that SSM's linear cost makes the computation of complex PE/SE a relative bottleneck.

GMNs offer theoretical guarantees. They are shown to be universal approximators of permutation-equivariant functions on graphs when equipped with positional encoding. Furthermore, with appropriate PE, they are provably more powerful than any Weisfeiler-Lehman (WL) isomorphism test, matching the expressive power of GTs. Notably, the paper provides theoretical justification that GMNs using the RWF encoder without PE/SE have unbounded expressive power, potentially distinguishing graphs that no k-WL test can.

Empirical evaluations demonstrate that GMNs achieve state-of-the-art or competitive performance across diverse benchmarks, including long-range graph datasets (LRGB), general GNN benchmarks (small and large scale), and heterophilic datasets. The results show that GMNs can outperform complex GTs and MPNNs while exhibiting significantly better memory efficiency and linear scalability compared to standard quadratic-cost GTs, as shown on large datasets like OGBN-Arxiv and MalNet-Tiny. An ablation paper confirms the importance of each component, particularly the bidirectional Mamba.

In summary, Graph Mamba Networks present a promising direction for graph representation learning by effectively adapting selective state space models. Their five-step framework, flexible tokenization, bidirectional SSM architecture, and strong empirical performance across various tasks and datasets, coupled with theoretical guarantees, establish them as a powerful, flexible, and scalable alternative to existing GNNs and GTs. The code is publicly available, facilitating implementation and further research.