Web Scraping for Research: Legal, Ethical, Institutional, and Scientific Considerations
Abstract: Scientists across disciplines often use data from the internet to conduct research, generating valuable insights about human behavior. However, as generative AI relying on massive text corpora becomes increasingly valuable, platforms have greatly restricted access to data through official channels. As a result, researchers will likely engage in more web scraping to collect data, introducing new challenges and concerns for researchers. This paper proposes a comprehensive framework for web scraping in social science research for U.S.-based researchers, examining the legal, ethical, institutional, and scientific factors that researchers should consider when scraping the web. We present an overview of the current regulatory environment impacting when and how researchers can access, collect, store, and share data via scraping. We then provide researchers with recommendations to conduct scraping in a scientifically legitimate and ethical manner. We aim to equip researchers with the relevant information to mitigate risks and maximize the impact of their research amidst this evolving data access landscape.
- \NAT@swatrue
- 15 U.S. Code §45. (n.d.). United States Code. (Unfair methods of competition unlawful; prevention by Commission) \NAT@swatrue
- 18 U.S.C. §1030(a)(2)(C). (n.d.). (Access to Computer Systems Without Authorization) \NAT@swatrue
- 4 CCR 904-3, Rule 2.02. (n.d.). Colorado Code of Regulations. (Details specific procedural or regulatory requirements under the specified title and chapter) \NAT@swatrue
- (2024). Online searches to evaluate misinformation can increase its perceived veracity. Nature, 625(7995), 548–556. https://doi.org/10.1038/s41586-023-06883-y. \NAT@swatrue
- (2020). The Pushshift Reddit Dataset. Proceedings of the International AAAI Conference on Web and Social Media, 14(1), 830–839. https://doi.org/10.1609/icwsm.v14i1.7347. \NAT@swatrue
- Berman v. Freedom Financial Network, LLC. (2022). 30 F.4th. (9th Cir.) \NAT@swatrue
- (2022). Election fraud, YouTube, and public perception of the legitimacy of President Biden. Journal of Online Trust and Safety, 1(3). https://doi.org/10.54501/jots.v1i3.60. \NAT@swatrue
- (2022). An Empirical Investigation of Personalization Factors on TikTok. WWW ’22: Proceedings of the ACM Web Conference 2022, 2298–2309. https://doi.org/10.1145/3485447.3512102. \NAT@swatrue
- (2021). Internet Research Ethics. The Stanford Encyclopedia of Philosophy. Retrieved 2021-01-12, from http://plato.stanford.edu/entries/ethics-internet-research/ \NAT@swatrue
- Cal. Civ. Code §1798.140(d) - Definitions. (n.d.). California Civil Code. (Part of California Consumer Privacy Act (CCPA), defining terms within the act) \NAT@swatrue
- Case 184/20, OT v Vyriausioji tarnybinės etikos komisija. (2022). Retrieved 2024-05-28, from https://eur-lex.europa.eu/legal-content/GA/TXT/?uri=CELEX:62020CJ0184 \NAT@swatrue
- The Citizen Browser Project—Auditing the Algorithms of Disinformation – The Markup. (2020). Retrieved 2024-03-20, from https://themarkup.org/citizen-browser \NAT@swatrue
- Clearview AI — Facial Recognition. (2023). Retrieved 2023-06-21, from https://www.clearview.ai \NAT@swatrue
- Coalition for Independent Technology Research. (2023). Letter: Twitter’s New API Plans Will Devastate Public Interest Research. Retrieved 2024-05-27, from https://independenttechresearch.org/letter-twitters-new-api-plans-will-devastate-public-interest-research/ \NAT@swatrue
- Colorado Privacy Act, §6-1-1303(17)(b) - Definitions. (n.d.). Colorado Privacy Act. (Specific provision defining key terms under the Colorado Privacy Act) \NAT@swatrue
- CrowdTangle. (2024). Important Update to CrowdTangle | March 2024 | CrowdTangle Help Center. Retrieved 2024-03-20, from http://help.crowdtangle.com/en/articles/9014544-important-update-to-crowdtangle-march-2024 \NAT@swatrue
- (2023). Platform-controlled social media APIs threaten open science. Nature Human Behaviour, 7(12), 2054–2057. \NAT@swatrue
- Digital Services Oversight and Safety Act of 2022, H.R. 6796. (2022). 117th Congress. Retrieved from https://www.congress.gov/bill/117th-congress/house-bill/6796 \NAT@swatrue
- (2022). Advocating for Platform Data Access: Challenges and Opportunities for Academics Seeking Policy Change. Politics and Governance, 10(1), 220–229. https://doi.org/10.17645/pag.v10i1.4713. \NAT@swatrue
- DSA Article 40(12). (n.d.). Digital Services Act. (Provisions related to risk assessment and mitigation measures under Article 40(12) of the Digital Services Act) \NAT@swatrue
- DSA Article 40(4). (n.d.). Digital Services Act. (Specifics of the provisions under Article 40(4) of the Digital Services Act) \NAT@swatrue
- DSA Article 40(8). (n.d.). Digital Services Act. (Specific provisions regarding the operation of online platforms under Article 40(8) of the Digital Services Act) \NAT@swatrue
- Easterbrook, F. H. (1996). Cyberspace and the Law of the Horse. University of Chicago Legal Forum, 1996, 207–208. https://chicagounbound.uchicago.edu/uclf/vol1996/iss1/7. \NAT@swatrue
- European Data Protection Board. (2018, April 11). Guidelines on Transparency under Regulation 2016/679. Retrieved from https://www.edpb.europa.eu/our-work-tools/our-documents/article-29-working-party-guidelines-transparency-under-regulation_en \NAT@swatrue
- European Data Protection Board. (2020a, Jan 7). Guidelines 3/2018 on the Territorial Scope of the GDPR. Retrieved from https://www.edpb.europa.eu/sites/default/files/files/file1/edpb_guidelines_3_2018_territorial_scope_after_public_consultation_en_0.pdf \NAT@swatrue
- European Data Protection Board. (2020b, Oct 20). Guidelines 4/2019 on Article 25 Data Protection by Design and by Default. Retrieved from https://www.edpb.europa.eu/sites/default/files/files/file1/edpb_guidelines_201904_dataprotection_by_design_and_by_default_v2.0_en.pdf \NAT@swatrue
- European Data Protection Board. (2024, May 23). Report of the Work Undertaken by the ChatGPT Taskforce. Retrieved from https://www.edpb.europa.eu/system/files/2024-05/edpb_20240523_report_chatgpt_taskforce_en.pdf \NAT@swatrue
- European Data Protection Supervisor. (2020, Jan 6). A Preliminary Opinion on Data Protection and Scientific Research. Retrieved from https://www.edps.europa.eu/sites/default/files/publication/20-01-06_opinion_research_en.pdf \NAT@swatrue
- European Digital Media Observatory. (2022, May 31). Report of the European Digital Media Observatory’s Working Group on Platform-to-Researcher Data Access, Annex 4 – Compendium of EU Member State Laws. Retrieved from https://edmo.eu/wp-content/uploads/2022/02/Report-of-the-European-Digital-Media-Observatorys-Working-Group-on-Platform-to-Researcher-Data-Access-2022.pdf \NAT@swatrue
- (2016). 844 F.3d 1058. (9th Cir.) \NAT@swatrue
- (2020). A Longitudinal Analysis of YouTube’s Promotion of Conspiracy Videos. arXiv. https://doi.org/10.48550/arXiv.2003.03318. \NAT@swatrue
- (2022). Mapping of Underdeveloped Areas Based On Research Frequency Utilizing Distributed Web Scraping and Web GIS. International Journal for Disaster and Development Interface, 2(2), 275–291. https://ijddi.net/index.php/ijddi/article/view/32. \NAT@swatrue
- Federal Trade Commission. (2024, Mar 4). FTC Cracks Down on Mass Data Collectors: A Closer Look at Avast, X-Mode, and InMarket. Retrieved from https://www.ftc.gov/policy/advocacy-research/tech-at-ftc/2024/03/ftc-cracks-down-mass-data-collectors-closer-look-avast-x-mode-inmarket \NAT@swatrue
- (2018). “Participant” Perceptions of Twitter Research Ethics. Social Media + Society, 4(1). https://doi.org/10.1177/2056305118763366. \NAT@swatrue
- (2016). Exploring Ethics and Obligations for Studying Digital Communities. GROUP ’16: Proceedings of the 2016 ACM International Conference on Supporting Group Work, 457–460. https://doi.org/10.1145/2957276.2996293. \NAT@swatrue
- (2024). Remember the Human: A Systematic Review of Ethical Considerations in Reddit Research. Proceedings of the ACM on Human-Computer Interaction, 8(GROUP). https://doi.org/10.1145/3633070. \NAT@swatrue
- (2020). Internet Research: Ethical Guidelines 3.0. Association of Internet Researchers. https://aoir.org/reports/ethics3.pdf. \NAT@swatrue
- Fung, B. (2023, Mar). DOJ will hire more data experts to scrutinize digital monopolies, Antitrust chief says — CNN business. Cable News Network. Retrieved from https://www.cnn.com/2023/03/06/tech/doj-data-experts/index.html. \NAT@swatrue
- GDPR Article 14(5)(b). (n.d.). General Data Protection Regulation. (Exceptions to information provided where personal data have not been obtained from the data subject) \NAT@swatrue
- GDPR Article 32. (n.d.). General Data Protection Regulation. (Security of processing) \NAT@swatrue
- GDPR Article 6. (n.d.). General Data Protection Regulation. (Legal basis for processing personal data under the GDPR) \NAT@swatrue
- GDPR Article 89(1). (n.d.). General Data Protection Regulation. (Safeguards and derogations relating to processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes) \NAT@swatrue
- (1982). Inconsistency and Institutional Review Boards. JAMA, 248(2), 197-202. https://doi.org/10.1001/jama.1982.03330020041027. \NAT@swatrue
- (2014). Assessing the bias in samples of large online networks. Social Networks, 38, 16–27. https://doi.org/10.1016/j.socnet.2014.01.004. \NAT@swatrue
- Gray, M. (1995). Measuring the Growth of the Web. Retrieved 2023-06-21, from https://www.mit.edu/people/mkgray/growth/ \NAT@swatrue
- (2023, December). The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work. The New York Times. Retrieved 2024-03-20, from https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html \NAT@swatrue
- (2021). Saving social media data: Understanding data management practices among social media researchers and their implications for archives. Journal of the Association for Information Science and Technology, 72(1), 97–109. https://doi.org/10.1002/asi.24368. \NAT@swatrue
- (2016, 10 03). The Online Video View: We Can Count It, but Can We Count on It? The New York Times. Retrieved from https://www.nytimes.com/2016/10/03/business/media/the-online-video-view-we-can-count-it-but-can-we-count-on-it.html (Accessed: 2024-04-18) \NAT@swatrue
- hiQ Labs, Inc. v. LinkedIn Corporation. (2022). 31 F. 4th 1180. (9th Cir.) \NAT@swatrue
- Ian Krietzberg. (2024, Feb 29). Here are all the copyright lawsuits against ChatGPT-maker OpenAI. TheStreet. Retrieved from https://www.thestreet.com/technology/copyright-lawsuits-against-openai-microsoft-chatgpt \NAT@swatrue
- ICPSR About the Organization. (n.d.). Inter-University Consortium for Political and Social Research. Retrieved 2023-05-18, from https://www.icpsr.umich.edu/web/pages/about/ \NAT@swatrue
- iThenticate — Plagiarism Checking for Academic Research — Turnitin. (2023). Retrieved 2023-06-21, from https://www.turnitin.com/products/ithenticate \NAT@swatrue
- Kids Online Safety Act, LYN22092 2SF. (2022). 117th Congress. Retrieved from https://www.blumenthal.senate.gov/imo/media/doc/kids_online_safety_act_-_bill_text.pdf \NAT@swatrue
- (2022). “This Isn’t Your Data, Friend”: Black Twitter as a Case Study on Research Ethics for Public Data. Social Media + Society, 8(4). https://doi.org/10.1177/20563051221144317. \NAT@swatrue
- (2022, September). Robots Exclusion Protocol (No. 9309). RFC 9309. RFC Editor. Retrieved from https://www.rfc-editor.org/info/rfc9309 doi: 10.17487/RFC9309 \NAT@swatrue
- (2021). COVID-Scraper: An Open-Source Toolset for Automatically Scraping and Processing Global Multi-Scale Spatiotemporal COVID-19 Records. IEEE Access, 9, 84783–84798. https://doi.org/10.1109/ACCESS.2021.3085682. \NAT@swatrue
- (2022). Spreading the disease: Protest in times of pandemics. Health Economics, 31(12), 2664–2679. https://doi.org/10.1002/hec.4602. \NAT@swatrue
- (2020). Computational social science: Obstacles and opportunities. Science, 369(6507), 1060–1062. https://doi.org/10.1126/science.aaz8170. \NAT@swatrue
- Levine, S. (2021). Letter from Acting Director of the Bureau of Consumer Protection Samuel Levine to Facebook. Retrieved from https://www.ftc.gov/blog-posts/2021/08/letter-acting-director-bureau-consumer-protection-samuel-levine-facebook \NAT@swatrue
- (2012). Ethical Decision-Making and Internet Research: Recommendations from the AoIR Ethics Working Committee (Version 2.0) (Tech. Rep.). Association of Internet Researchers. Retrieved from http://aoir.org/reports/ethics2.pdf \NAT@swatrue
- (2023). Dialing for Videos: A Random Sample of YouTube. Journal of Quantitative Description: Digital Media, 3. https://doi.org/10.51685/jqd.2023.022. \NAT@swatrue
- MDDC About. (n.d.). Media and Democracy Data Cooperative. Retrieved 2023-05-18, from https://mddatacoop.org/about/ \NAT@swatrue
- Meta Platforms, Inc. v. BrandTotal Ltd. (2022). 605 F.Supp.3d. (N.D. Cal.) \NAT@swatrue
- Meta Platforms, Inc. v. Bright Data Ltd. (2024). 2024 WL 251406. (N.D. Cal., Jan. 23) \NAT@swatrue
- (2016). Where are human subjects in big data research? The emerging ethics divide. Big Data & Society, 3(1). https://doi.org/10.1177/2053951716650211. \NAT@swatrue
- Munger, K. (2023). Temporal validity as meta-science. Research & Politics, 10(3), 20531680231187271. https://doi.org/10.1177/20531680231187271. \NAT@swatrue
- (2024). Digital trace data collection for social media effects research: APIs, data donation, and (screen) tracking. Communication Methods and Measures, 18(2), 124–141. \NAT@swatrue
- OpenAI — ChatGPT. (2024). Retrieved 2024-09-10, from https://openai.com/chatgpt/ \NAT@swatrue
- Ortutay, B. (2021). Facebook shuts out NYU academics’ research on political ads. AP News. Retrieved from https://apnews.com/article/technology-business-5d3021ed9f193bf249c3af158b128d18 \NAT@swatrue
- (2011). Measuring API documentation on the web. Web2SE ’11 Proceedings of the 2nd International Workshop on Web 2.0 for Software Engineering, 25–30. https://doi.org/10.1145/1984701.1984706. \NAT@swatrue
- (2022). No Humans Here: Ethical Speculation on Public Data, Unintended Consequences, and the Limits of Institutional Review. Proceedings of the ACM on Human-Computer Interaction, 6(GROUP), 1–13. https://doi.org/10.1145/3492857. \NAT@swatrue
- PERVADE. (2023). PERVADE Data Ethics Tool. https://pervade.umd.edu/pervade-data-%20ethics-tool/. (Accessed: 05/23/2024) \NAT@swatrue
- The Platform Accountability and Transparency Act, LYN23256 1RR. (2023). 118th Congress. Retrieved from https://www.coons.senate.gov/imo/media/doc/pata_bill_text_118th_congress1.pdf \NAT@swatrue
- Police Data Accessibility Project. (2023). Retrieved 2023-06-21, from https://www.pdap.io \NAT@swatrue
- (2018). Auditing Partisan Audience Bias within Google Search. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW), 1–22. https://doi.org/10.1145/3274417. \NAT@swatrue
- Ruschemeier, H. (2023). Data Brokers and European Digital Legislation. European Data Protection Law Review. Retrieved from https://edpl.lexxion.eu/article/edpl/2023/1/7 \NAT@swatrue
- Sandvig v. Barr. (2020). Civ. Action No. 16-1368. (D.D.C., March 28) \NAT@swatrue
- Sellars, A. (2018). Twenty Years of Web Scraping and the Computer Fraud and Abuse Act. Boston University Journal of Science & Technology Law, 24, 372–376. https://scholarship.law.bu.edu/faculty_scholarship/465. \NAT@swatrue
- (2021). Excavating awareness and power in data science: A manifesto for trustworthy pervasive data research. Big Data & Society, 8(2). https://doi.org/10.1177/205395172110407. \NAT@swatrue
- (2021). Trumping Hate on Twitter? Online Hate Speech in the 2016 U.S. Election Campaign and its Aftermath. Quarterly Journal of Political Science, 16(1), 71–104. https://doi.org/10.1561/100.00019045. \NAT@swatrue
- Singel, R. (2011). Google Catches Bing Copying; Microsoft Says ’So What?’. Retrieved 2023-03-24, from https://www.wired.com/2011/02/bing-copies-google/ \NAT@swatrue
- Sobel, B. L. W. (2021). A New Common Law of Web Scraping. Lewis & Clark Law Review, 25, 147. https://law.lclark.edu/live/files/31605-7-sobel-article-251pdf. \NAT@swatrue
- Social Media Disclosure and Transparency of Advertisements Act of 2021. (2021). 117th Congress. Retrieved from https://trahan.house.gov/uploadedfiles/social_media_data_act_bill_text.pdf \NAT@swatrue
- The Markup Staff. (2020). Why Web Scraping Is Vital to Democracy. Retrieved 2023-06-21, from https://themarkup.org/news/2020/12/03/why-web-scraping-is-vital-to-democracy \NAT@swatrue
- TikTok. (2023). Video Play Reporting Metrics. Retrieved 2024-04-18, from https://ads.tiktok.com/help/article/video-play \NAT@swatrue
- (2017). We don’t know what we don’t know: when and how the use of Twitter’s public APIs biases scientific inference. Available at SSRN 3079927. http://dx.doi.org/10.2139/ssrn.3079927. \NAT@swatrue
- (2017). Ethics Regulation in Social Computing Research: Examining the Role of Institutional Review Boards. Journal of Empirical Research on Human Research Ethics, 12(5), 372–382. https://doi.org/10.1177/15562646177252. \NAT@swatrue
- Wa. Civ. Code §19.373.010(c). (n.d.). Washington Civil Code. (Particular provision detailing specifics under the Washington Civil Code) \NAT@swatrue
- (2021). Platform enclosure of human behavior and its measurement: Using behavioral trace data against platform episteme. New Media & Society, 23(9), 2650–2667. https://doi.org/10.1177/1461444820933547. \NAT@swatrue
- X Corp. v. Bright Data Ltd. (2023). Case No. 3:23-cv-03698. \NAT@swatrue
- X Corp. v. Center for Countering Digital Hate, Inc. (2024). 2024 WL 1246318. (N.D. Cal., March 25) \NAT@swatrue
- XDevelopers. (2023). Announcing new access tiers for the twitter api. Retrieved 2024-05-27, from https://devcommunity.x.com/t/announcing-new-access-tiers-for-the-twitter-api/188728 \NAT@swatrue
- Yin, L. (2023). Finding Undocumented APIs. In L. Yin and P. Sapiezynski (Eds.), Inspect Element: A practitioner’s guide to auditing algorithms and hypothesis-driven investigations. (https://inspectelement.org)
- Zimmer, M. (2018). Addressing Conceptual Gaps in Big Data Research Ethics: An Application of Contextual Integrity. Social Media + Society, 4(2). https://doi.org/10.1177/2056305118768300.
- (2020). Ethical Review Boards and Pervasive Data Research: Gaps and Opportunities. AoIR Selected Papers of Internet Research, 2020. https://doi.org/10.5210/spir.v2020i0.11369.
- (2017). Ten simple rules for responsible big data research. PLoS computational biology, 13(3), e1005399. https://doi.org/10.1371/journal.pcbi.1005399.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.