Efficient Text-Attributed Graph Learning through Selective Annotation and Graph Alignment

Published 8 Jun 2025 in cs.LG, cs.AI, and cs.CL | (2506.07168v1)

Abstract: In the realm of Text-attributed Graphs (TAGs), traditional graph neural networks (GNNs) often fall short due to the complex textual information associated with each node. Recent methods have improved node representations by leveraging LLMs to enhance node text features, but these approaches typically require extensive annotations or fine-tuning across all nodes, which is both time-consuming and costly. To overcome these challenges, we introduce GAGA, an efficient framework for TAG representation learning. GAGA reduces annotation time and cost by focusing on annotating only representative nodes and edges. It constructs an annotation graph that captures the topological relationships among these annotations. Furthermore, GAGA employs a two-level alignment module to effectively integrate the annotation graph with the TAG, aligning their underlying structures. Experiments show that GAGA achieves classification accuracies on par with or surpassing state-of-the-art methods while requiring only 1% of the data to be annotated, demonstrating its high efficiency.

Abstract PDF Upgrade to Chat

Summary

The paper introduces GAGA, which selects only about 1% of nodes for annotation to reduce costs and time without sacrificing performance.
It applies a two-level graph alignment with contrastive learning to integrate semantic and structural information effectively.
Experiments on datasets like PubMed demonstrate up to 100x efficiency gains with classification accuracy reaching 94.61%.

Efficient Text-Attributed Graph Learning through Selective Annotation and Graph Alignment

The paper, authored by Huanyi Xie and collaborators, presents a framework named GAGA (Graph Alignment Guided Annotation) for efficient learning in Text-attributed Graphs (TAGs). In TAGs, nodes are associated with textual data, and the use of Graph Neural Networks (GNNs) alongside LLMs has led to improvements in node representation. However, traditional methods that apply LLMs across all nodes result in inefficiencies, especially concerning time and cost due to extensive annotation requirements. GAGA addresses these inefficiencies through selective annotation and structural alignment, significantly reducing the resources needed while maintaining or surpassing the performance of state-of-the-art methods.

Methodology

GAGA is based on three core stages, each designed to streamline the TAG learning process:

Selective Annotation: Instead of annotating all nodes, GAGA intelligently selects only a representative 1% of nodes or edges for annotation. Nodes are chosen based on "information density" which ensures they reflect key features of the data distribution. This step effectively reduces annotation costs and time, acknowledging that a limited subset can adequately summarize the graph's structure.
Graph Alignment: Once the subset is selected, GAGA constructs an annotation graph that captures both the semantic and structural information of these annotated nodes. The framework utilizes a two-level alignment mechanism—aligning sub-annotation graphs with the original TAG using contrastive learning. This approach leverages insights from subgraph-level and prototype-level alignments ensuring the derived node embeddings integrate both topological and semantic insights.
Model Deployment: For practical applications, GAGA leverages the embeddings generated from the alignment step. It employs these in downstream classification or prediction tasks where only the GNN is fine-tuned, while keeping the LLM components frozen, thus further enhancing computational efficiency.

Numerical Results

Extensive experiments carried out on multiple large datasets, including ogbn-arxiv and PubMed, demonstrate GAGA's efficacy. The framework achieves classification accuracy comparable to or exceeding leading models, requiring annotation of just 1% of the data. This results in a remarkable improvement in efficiency, reported to be up to 100 times. For instance, on the PubMed dataset, GAGA achieved a classification accuracy of 94.61%, with significantly reduced annotation time and financial costs.

Implications and Future Developments

The implications of GAGA are substantial, offering a robust and cost-efficient alternative for learning in TAGs. By showcasing that accurate node representations can be achieved through minimal annotations, GAGA paves the way for more sustainable AI applications in various domains including text classification, recommendation systems, and network analysis.

Theoretically, GAGA contributes to the optimization of resource allocation in LLM-augmented GNN models. Practically, it allows for the scalable deployment of graph learning models where data annotations are sparse or costly, without compromising on accuracy.

Looking ahead, future research could explore extending GAGA's approach to enable the pre-training of models for cross-dataset applications, thereby creating a versatile tool that adapts to diverse graph datasets seamlessly. Also, with the rise of more robust LLMs, future work may include integrating advanced LLM architectures to further enhance the semantic understanding of textual graphs.

This paper stands as a crucial step toward efficient and scalable text-attributed graph learning, demonstrating that selective annotation combined with strategic model alignment can profoundly enhance performance and utility in graph-based AI applications.

Markdown