Dice Question Streamline Icon: https://streamlinehq.com

Impact of extended Levi graph tokenization on GLM behavior

Determine the impact of removing whitespace between concepts and edges that occurs when tokenizing each node of the Levi graph individually in the Graph Language Model’s graph preprocessing pipeline, specifically on the resulting tokenization and representations produced when encoding Graphs of Triplets.

Information Square Streamline Icon: https://streamlinehq.com

Background

To make LLMs operate natively on graphs, the paper converts Graphs of Triplets into extended Levi graphs and tokenizes each node individually. This procedure removes whitespace between concepts and relation tokens to ensure consistent tokenization for shared concepts across triplets.

The authors note that this modification may change the token sequence compared to a straightforward linearization of triplets, potentially affecting how pretrained LM weights interact with graph-structured inputs. They explicitly defer assessing the consequences of this change.

References

This removes whitespace between concepts and edges, which impacts tokenization. We leave investigation of the impact of this effect to future work.

Graph Language Models (2401.07105 - Plenz et al., 13 Jan 2024) in Section 4, Graph Language Model — Graph preprocessing (footnote)