Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Open Review-Based (ORB) dataset: Towards Automatic Assessment of Scientific Papers and Experiment Proposals in High-Energy Physics (2312.04576v1)

Published 29 Nov 2023 in cs.DL, cs.CL, cs.LG, and hep-ex

Abstract: With the Open Science approach becoming important for research, the evolution towards open scientific-paper reviews is making an impact on the scientific community. However, there is a lack of publicly available resources for conducting research activities related to this subject, as only a limited number of journals and conferences currently allow access to their review process for interested parties. In this paper, we introduce the new comprehensive Open Review-Based dataset (ORB); it includes a curated list of more than 36,000 scientific papers with their more than 89,000 reviews and final decisions. We gather this information from two sources: the OpenReview.net and SciPost.org websites. However, given the volatile nature of this domain, the software infrastructure that we introduce to supplement the ORB dataset is designed to accommodate additional resources in the future. The ORB deliverables include (1) Python code (interfaces and implementations) to translate document data and metadata into a structured and high-level representation, (2) an ETL process (Extract, Transform, Load) to facilitate the automatic updates from defined sources and (3) data files representing the structured data. The paper presents our data architecture and an overview of the collected data along with relevant statistics. For illustration purposes, we also discuss preliminary Natural-Language-Processing-based experiments that aim to predict (1) papers' acceptance based on their textual embeddings, and (2) grading statistics inferred from embeddings as well. We believe ORB provides a valuable resource for researchers interested in open science and review, with our implementation easing the use of this data for further analysis and experimentation. We plan to update ORB as the field matures as well as introduce new resources even more fitted to dedicated scientific domains such as High-Energy Physics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Global viewing of heterogeneous data sources. IEEE Transactions on Knowledge and Data Engineering, 13(2):277–297, 2001.
  2. A discourse-aware attention model for abstractive summarization of long documents. arXiv preprint arXiv:1804.05685, 2018.
  3. A proposed model for data warehouse etl processes. Journal of King Saud University-Computer and Information Sciences, 23(2):91–104, 2011.
  4. Does my rebuttal matter? insights from a major NLP conference. In NAACL-HLT 2019, Volume 1, pages 1274–1290, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1129. URL https://aclanthology.org/N19-1129.
  5. IBM. Build a modern data architecture, Jun 2022. URL https://www.ibm.com/resources/the-data-differentiator/data-architecture.
  6. A dataset of peer reviews (peerread): Collection, insights and nlp applications. arXiv preprint arXiv:1804.09635, 2018.
  7. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. PMLR, 2014.
  8. Moprd: A multidisciplinary open peer review dataset. arXiv preprint arXiv:2212.04972, 2022.
  9. Datastack: unification of heterogeneous machine learning dataset interfaces. In 2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW), pages 66–69. IEEE, 2022.
  10. Implementing big data lake for heterogeneous data sources. In 2019 ieee 35th international conference on data engineering workshops (icdew), pages 37–44. IEEE, 2019.
  11. Barbara Plank and Reinard van Dalen. Citetracked: a longitudinal dataset of peer reviews and citations. In 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019), pages 116–122. CEUR Workshop Proceedings, 2019.
  12. Tony Ross-Hellauer. What is open peer review? a systematic review. F1000Research, 6, 2017.
  13. Representation and extraction of physics knowledge based on knowledge graph and embedding-combined text classification for cooperative learning. In 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD), pages 1053–1058. IEEE, 2022.
  14. Mred: A meta-review dataset for structure-controllable text generation. Findings of the Association for Computational Linguistics: ACL 2022, pages 2521–2535, 2022.
  15. Sergey Stupnikov. Applying model-driven approach for data model unification. In Modelling to Program: Second International Workshop, M2P 2020, Lappeenranta, Finland, March 10–12, 2020, Revised Selected Papers 1, pages 212–232. Springer, 2021.
  16. A uml based approach for modeling etl processes in data warehouses. In Conceptual Modeling-ER 2003: 22nd International Conference on Conceptual Modeling, Chicago, IL, USA, October 13-16, 2003. Proceedings 22, pages 307–320. Springer, 2003.
  17. Panos Vassiliadis. A survey of extract–transform–load technology. International Journal of Data Warehousing and Mining (IJDWM), 5(3):1–27, 2009.
  18. Carissa Véliz. The Oxford Handbook of Digital Ethics. Oxford University Press. ISBN 9780198857815. doi: 10.1093/oxfordhb/9780198857815.001.0001. URL https://doi.org/10.1093/oxfordhb/9780198857815.001.0001.
  19. What have we learned from openreview? World Wide Web, 26(2):683–708, 2023.
  20. Linda Williams. New dataset offers unique insights into peer review, Sep 2021. URL https://www.elsevier.com/connect/new-dataset-offers-unique-insights-into-peer-review.
  21. Open peer review: The current landscape and emerging models. 2019.
  22. Open peer review: promoting transparency in open science. Scientometrics, 125(2):1033–1051, 2020.
  23. Can we automate scientific reviewing? Journal of Artificial Intelligence Research, 75:171–212, 2022.

Summary

We haven't generated a summary for this paper yet.