Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Novel Method for News Article Event-Based Embedding (2405.13071v2)

Published 20 May 2024 in cs.CL, cs.AI, and cs.SI

Abstract: Embedding news articles is a crucial tool for multiple fields, such as media bias detection, identifying fake news, and making news recommendations. However, existing news embedding methods are not optimized to capture the latent context of news events. Most embedding methods rely on full-text information and neglect time-relevant embedding generation. In this paper, we propose a novel lightweight method that optimizes news embedding generation by focusing on entities and themes mentioned in articles and their historical connections to specific events. We suggest a method composed of three stages. First, we process and extract events, entities, and themes from the given news articles. Second, we generate periodic time embeddings for themes and entities by training time-separated GloVe models on current and historical data. Lastly, we concatenate the news embeddings generated by two distinct approaches: Smooth Inverse Frequency (SIF) for article-level vectors and Siamese Neural Networks for embeddings with nuanced event-related information. We leveraged over 850,000 news articles and 1,000,000 events from the GDELT project to test and evaluate our method. We conducted a comparative analysis of different news embedding generation methods for validation. Our experiments demonstrate that our approach can both improve and outperform state-of-the-art methods on shared event detection tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. The impact of digital platforms on news and journalistic content. Digital Platforms Inquiry, 2018.
  2. Measuring the media agenda. Political Communication, 31(2):355–380, 2014.
  3. We can detect your bias: Predicting the political ideology of news articles. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4982–4991, 2020.
  4. The effect of fox news on health behavior during covid-19. Available at SSRN 3636762, 2020.
  5. Fake news, rumor, information pollution in social media and web: A contemporary survey of state-of-the-arts, challenges and opportunities. Expert Systems with Applications, 153:112986, 2020.
  6. An exploration of how fake news is taking over social media and putting public health at risk. Health Information & Libraries Journal, 38(2):143–149, 2021.
  7. Estimating countries’ peace index through the lens of the world news as monitored by gdelt. In 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), pages 216–225, 2020.
  8. LI Bing and Peng Fei. The evolution of geo-relations between china and southeast asian countries based on gdelt. World Regional Studies, 30(6):1127, 2021.
  9. Using the gdelt dataset to analyse the italian sovereign bond market. In Machine Learning, Optimization, and Data Science: 6th International Conference, LOD 2020, Siena, Italy, July 19–23, 2020, Revised Selected Papers, Part I 6, pages 190–202. Springer, 2020.
  10. Chenyu Zheng. Comparisons of the city brand influence of global cities: Word-embedding based semantic mining and clustering analysis on the big data of gdelt global news knowledge graph. Sustainability, 12(16):6294, 2020.
  11. Liang Zhao. Event prediction in the big data era: A systematic survey. ACM Computing Surveys (CSUR), 54(5):1–37, 2021.
  12. Analyzing international event data: a handbook of computer-based techniques. University of Kansas, Online Manuscript, http://www. ku. edu/keds/papers. dir/automated. html, 2000.
  13. Edward E Azar. The conflict and peace data bank (copdab) project. Journal of Conflict Resolution, 24(1):143–152, 1980.
  14. Charles A McClelland. World-event-interaction-survey: A research project on the theory and measurement of international interaction and transaction. University of Southern California, 1967.
  15. Gdelt: Global data on events, location, and tone, 1979–2012. In ISA annual convention, volume 2, pages 1–49. Citeseer, 2013.
  16. Philip A Schrodt. Cameo: Conflict and mediation event observations event and actor codebook. Pennsylvania State University, 610:35, 2012.
  17. Measuring the political salience of supreme court cases. Journal of Law and Courts, 3(1):37–65, 2015.
  18. Forecasting civil wars: Theory and structure in an age of “big data” and machine learning. Journal of Conflict Resolution, 64(10):1885–1915, 2020.
  19. News2vec: News network embedding with subnode information. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 4843–4852, 2019.
  20. Tackling fake news detection by continually improving social context representations using graph neural networks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1363–1380, 2022.
  21. Context-aware graph embedding for session-based news recommendation. In Proceedings of the 14th ACM Conference on Recommender Systems, RecSys ’20, page 657–662, New York, NY, USA, 2020. Association for Computing Machinery.
  22. Event2vec: Neural embeddings for news events. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1013–1016, 2018.
  23. News recommender system: a review of recent progress, challenges, and opportunities. Artificial Intelligence Review, pages 1–52, 2022.
  24. Unbert: User-news matching bert for news recommendation. In IJCAI, volume 21, pages 3356–3362, 2021.
  25. Mengjia Xu. Understanding graph embedding methods and their applications. SIAM Review, 63(4):825–853, 2021.
  26. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
  27. A simple but tough-to-beat baseline for sentence embeddings. In International conference on learning representations, 2017.
  28. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5005–5013, 2016.
  29. Detecting the magnitude of events from news articles. In 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pages 177–184, 2016.
  30. Beyond word embeddings: A survey. Information Fusion, 89:418–436, 2023.
  31. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  32. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  33. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. PMLR, 2014.
  34. Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088, 2018.
  35. The evolution of topic modeling. ACM Computing Surveys, 54(10s):1–35, 2022.
  36. Topic modeling: a comprehensive review. EAI Endorsed transactions on scalable information systems, 7(24), 2019.
  37. The evolution of topic modeling. ACM Comput. Surv., 54(10s), nov 2022.
  38. Modeling the evolution of climate change assessment research using dynamic topic models and cross-domain divergence maps. In 2017 AAAI Spring Symposium Series, 2017.
  39. Words that matter: How the news and social media shaped the 2016 Presidential campaign. Brookings Institution Press, 2019.
  40. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics, 3:299–313, 2015.
  41. Christopher E Moody. Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv preprint arXiv:1605.02019, 2016.
  42. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3–26, 2007.
  43. Ernie: Enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129, 2019.
  44. Named entity recognition in query. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 267–274, 2009.
  45. Mayank Kejriwal. Domain-specific knowledge graph construction. Springer, 2019.
  46. Named entity resources-overview and outlook. Language Resources and Evaluation, 2016.
  47. Yuli Vasiliev. Natural language processing with Python and spaCy: A practical introduction. No Starch Press, 2020.
  48. Natural language processing: python and NLTK. Packt Publishing Ltd, 2016.
  49. Flair: An easy-to-use framework for state-of-the-art nlp. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (demonstrations), pages 54–59, 2019.
  50. Medieval spanish (12th–15th centuries) named entity recognition and attribute annotation system based on contextual information. Journal of the Association for Information Science and Technology, 72(2):224–238, 2021.
  51. Named entity recognition and classification in historical documents: A survey. ACM Computing Surveys, 2023.
  52. Named entity recognition with bidirectional lstm-cnns. Transactions of the association for computational linguistics, 4:357–370, 2016.
  53. Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364, 2017.
  54. A survey of event extraction from text. IEEE Access, 7:173111–173137, 2019.
  55. Fake news detection: A survey of graph neural network methods. Applied Soft Computing, page 110235, 2023.
  56. “making the news”: Identifying noteworthy events in news articles. In Proceedings of the Fourth Workshop on Events, pages 1–7, 2016.
  57. Efficient methods for natural language processing: A survey. Transactions of the Association for Computational Linguistics, 11:826–860, 2023.
  58. Kalev Leetaru. Data mining methods for the content analyst: An introduction to the computational analysis of content. Routledge, 2012.
  59. Generalized hamming distance. Information Retrieval, 5:353–375, 2002.
  60. Autoembedder: A semi-supervised dnn embedding system for clustering. Knowledge-Based Systems, 204:106190, 2020.
  61. Signature verification using a” siamese” time delay neural network. Advances in neural information processing systems, 6, 1993.
  62. Intention detection based on siamese neural network with triplet loss. IEEE Access, 8:82242–82254, 2020.
  63. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
  64. Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE international conference on computer vision, pages 118–126, 2015.
  65. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
  66. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  67. Action recognition based on discriminative embedding of actions using siamese networks. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 3473–3477. IEEE, 2018.
  68. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pages 233–240, 2006.
  69. Milton Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the american statistical association, 32(200):675–701, 1937.
  70. Peter Bjorn Nemenyi. Distribution-free multiple comparisons. Princeton University, 1963.
  71. Red media, blue media: Evidence of ideological selectivity in media use. Journal of communication, 59(1):19–39, 2009.
  72. The utilization of machine learning algorithms for assisting physicians in the diagnosis of diabetes. Diagnostics, 13(12):2087, 2023.
  73. Janez Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research, 7:1–30, 2006.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Koren Ishlach (1 paper)
  2. Itzhak Ben-David (1 paper)
  3. Michael Fire (37 papers)
  4. Lior Rokach (63 papers)