Paths of A Million People: Extracting Life Trajectories from Wikipedia (2406.00032v2)

Published 25 May 2024 in cs.CL, cs.AI, and cs.IR

Abstract: The life trajectories of notable people have been studied to pinpoint the times and places of significant events such as birth, death, education, marriage, competition, work, speeches, scientific discoveries, artistic achievements, and battles. Understanding how these individuals interact with others provides valuable insights for broader research into human dynamics. However, the scarcity of trajectory data in terms of volume, density, and inter-person interactions, limits relevant studies from being comprehensive and interactive. We mine millions of biography pages from Wikipedia and tackle the generalization problem stemming from the variety and heterogeneity of the trajectory descriptions. Our ensemble model COSMOS, which combines the idea of semi-supervised learning and contrastive learning, achieves an F1 score of 85.95%. For this task, we also create a hand-curated dataset, WikiLifeTrajectory, consisting of 8,852 (person, time, location) triplets as ground truth. Besides, we perform an empirical analysis on the trajectories of 8,272 historians to demonstrate the validity of the extracted results. To facilitate the research on trajectory extractions and help the analytical studies to construct grand narratives, we make our code, the million-level extracted trajectories, and the WikiLifeTrajectory dataset publicly available.

Summary

The paper introduces the WikiLifeTrajectory dataset comprising 8,852 trajectory triplets serving as a ground truth for future research.
The COSMOS model integrates CNN, BERT, supervised contrastive learning, and semi-supervised learning to achieve an 85.95% F1 score.
Empirical analysis of 8,272 historians’ trajectories validates the model’s accuracy and highlights its potential for mobility and sociopolitical studies.

Extracting Life Trajectories from Wikipedia

In their paper, Zhang et al. propose a novel approach to extract life trajectories of notable individuals using Wikipedia as a data source. The scarcity of comprehensive trajectory data has historically limited studies on human dynamics. Existing datasets often lack the volume, density, and inter-person interactions necessary for detailed analysis. This research addresses these limitations by mining bio pages from Wikipedia and creating the COSMOS model, which integrates semi-supervised learning and contrastive learning to improve extraction accuracy.

Key Contributions

Creation of the Dataset:
- A significant contribution of this work is the introduction of the WikiLifeTrajectory dataset, a hand-curated collection of 8,852 trajectory triplets. This dataset serves as a ground truth, facilitating future research in trajectory extraction tasks.
COSMOS Model:
- The COSMOS (COntrastive learning and Semi-supervised learning MOdel for extracting Spatio-temporal life trajectory) model leverages both supervised and semi-supervised learning. This dual approach improves the model's generalizability and accuracy. COSMOS achieved an F1 score of 85.95%, outperforming various baselines, including CNN, Bi-LSTM, BERT, and RoBERTa.
Empirical Analysis:
- The authors also perform an empirical analysis of 8,272 historians' trajectories extracted using COSMOS, demonstrating the practical utility and validity of their method.

Methodology

Data Extraction and Annotation

Wikipedia biography pages serve as the primary data source. The researchers use Named Entity Recognition (NER) implemented via SpaCy to extract entities. They then employ a parse-tree-based distance metric to establish (person, time, location) triplets. This method ensures a high coverage of the mentioned trajectories in Wikipedia, with preliminary tests achieving an 85% inclusion rate.

For annotation, a combination of human annotators and GPT-3.5 is utilized. While human annotators handle the bulk of the work, GPT-3.5 proves useful for annotating larger, more verbose entries, confirming that the right balance of manual and automated efforts can achieve high-quality data.

Model Architecture

The COSMOS framework uses a hybrid structure:

CNN and BERT: The model learns representations through parallel CNN and BERT architectures. While CNN effectively captures local semantics, BERT excels at understanding contextual nuances, especially in longer texts.
Supervised Contrastive Learning: To improve generalization, supervised contrastive learning helps the model distinguish between similar and dissimilar examples.
Semi-supervised Learning: Pseudo-labeling allows the model to utilize a significant amount of unlabeled data, enhancing its robustness against diverse contexts.

Experimental Results

The authors rigorously tested COSMOS against several baselines such as CNN, Bi-LSTM, BERT, RoBERTa, and GPT-3.5. COSMOS consistently outperformed these models, achieving an F1 score of 85.95% on the "Representative" test set. Notably, while GPT-3.5 achieved high Recall, its lower Precision (56.53%) introduced significant noise, demonstrating that general-purpose LLMs may not yet be suited for such specialized tasks.

Ablation Studies

To gauge the importance of each component, ablation studies were conducted. Removing semi-supervised learning and contrastive learning significantly degraded performance, confirming that both are critical for achieving high accuracy and generalization.

Implications and Future Work

Practical Implications

The presented work has substantial practical implications. Beyond cultural and historical analysis, the large-scale, fine-grained dataset can aid studies in mobility analysis, sociopolitical dynamics, and interaction networks. As the trajectories encompass various life events, the dataset could potentially enable new insights into the mechanisms of human mobility and social interactions.

Theoretical Implications

From a theoretical standpoint, combining semi-supervised and contrastive learning in the COSMOS model represents an innovative step in NLP tasks. The method's success underscores the value of these techniques in tackling the generalization issues inherent in diverse text corpora.

Future Directions

Future work could expand the framework to non-English Wikipedia pages, addressing the identified limitation of potential bias towards the English-speaking world. Additionally, incorporating more sophisticated extraction algorithms or extending the task to include identifying the types of trajectories (e.g., educational, professional) could further refine the dataset's utility.

Conclusion

Zhang et al.'s work is a meaningful contribution to both the fields of NLP and human dynamics. By leveraging the dense and diverse data in Wikipedia and integrating state-of-the-art learning techniques, they present a robust tool for extracting and analyzing life trajectories. This research not only advances our methodological toolkit but also lays the groundwork for future studies on human mobility and interaction.

PDF Markdown

Related Papers

Tweets

https://twitter.com/WikiResearch/status/1798215233653412232