LLMs generate structurally realistic social networks but overestimate political homophily (2408.16629v1)

Published 29 Aug 2024 in cs.CY, cs.AI, and cs.SI

Abstract: Generating social networks is essential for many applications, such as epidemic modeling and social simulations. Prior approaches either involve deep learning models, which require many observed networks for training, or stylized models, which are limited in their realism and flexibility. In contrast, LLMs offer the potential for zero-shot and flexible network generation. However, two key questions are: (1) are LLM's generated networks realistic, and (2) what are risks of bias, given the importance of demographics in forming social ties? To answer these questions, we develop three prompting methods for network generation and compare the generated networks to real social networks. We find that more realistic networks are generated with "local" methods, where the LLM constructs relations for one persona at a time, compared to "global" methods that construct the entire network at once. We also find that the generated networks match real networks on many characteristics, including density, clustering, community structure, and degree. However, we find that LLMs emphasize political homophily over all other types of homophily and overestimate political homophily relative to real-world measures.

PDF HTML Abstract

Understanding LLM-Generated Social Networks and Political Homophily

The paper "LLMs generate structurally realistic social networks but overestimate political homophily" addresses the potential and limitations of using LLMs for generating synthetic social networks. This research is situated at the intersection of social network analysis, computational social science, and NLP, exploring the realism of LLM-generated networks and their biases regarding demographic features, especially political affiliation.

Methodology Overview

The authors introduce a novel methodology for utilizing LLMs in the generation of social networks, focusing particularly on the following research questions:

Can LLM-generated networks match real-world social networks on structural characteristics?
Can LLMs capture demographic homophily?
How does incorporating interests affect LLM-generated networks?

To address these questions, the paper develops three prompting methods for network generation:

Global: Constructs the entire network at once.
Local: Constructs relations one persona at a time.
Sequential: Similar to Local but includes additional information about the network constructed so far.

Findings on Realism and Homophily

Structural Characteristics:

Local and Sequential Prompts: These methods outperform the Global method regarding the realism of network structure. Specifically, the Local and Sequential methods produce networks that align closely with real-world densities, clustering coefficients, connectivity, and degree distributions. For example, the Sequential method managed to replicate long-tailed degree distributions observed in real social networks, a haLLMark of heterogeneous connectivity.
Global Prompt: This method resulted in networks that deviated significantly from real-world structures, primarily due to low density and poor clustering. This is attributed to the overwhelming amount of information that the LLM needs to manage simultaneously.

Demographic Homophily:

Political Homophily: Across all experiments, the LLMs demonstrated clear homophily in political affiliation, with a marked tendency to overestimate this homophily. For instance, same-party relations in LLM-generated networks were found to be 85% more frequent than expected under the Local method, and 68% under the Sequential method.
Comparative Analysis: Comparing these findings to real-world data (e.g., \citet{halberstam2016homophily}), the levels of political homophily generated by the LLMs were significantly exaggerated. For context, empirical social networks displayed political homophily levels around 40.4% more frequent than expected, substantially lower than what the LLMs produced.

Additional Demographic Features:

The paper also discusses the impact of including additional persona interests in the network generation process. However, even with augmented interests, the overarching political homophily persisted, with interests themselves encoding political stereotypes.

Implications and Future Directions

Bias in LLM-Generated Networks:

The overestimation of political homophily by LLMs reveals inherent biases that need addressing, especially given LLMs' potential applications in scenarios requiring synthetic but realistic social networks (e.g., epidemic modeling, simulation of social phenomena). The implications of such biases could lead to unrealistic models of social behavior and spread of misinformation, particularly in politically polarized contexts.

Real-World Applications:

Despite these biases, the ability of LLMs to generate structurally realistic networks in a zero-shot and flexible manner holds significant promise. Machine learning approaches requiring substantial observed networks for training are frequently constrained by data availability and generalization issues. In contrast, LLMs provide a scalable alternative.

Theoretical Advances:

Examining why LLMs place disproportionate emphasis on political affiliation could lead to theoretical advancements in understanding the intersections between computational models and socio-political biases. This necessitates future research into LLM training datasets and the socio-political contexts they encapsulate, possibly re-strategizing pre-training processes to mitigate such biases.

Conclusion

The paper substantially contributes to the dialogue on using LLMs for social network generation, demonstrating their capabilities and highlighting critical areas of improvement. Moving forward, research should focus on developing methods that effectively balance the realism and cost of network generation, address the embedded biases in LLM outputs, and enhance the diversity and variance of generated networks.

These findings underscore the nuanced understanding required to deploy LLMs in socially impactful domains, advocating for more robust, fair, and theoretically grounded approaches in computational social science.