Web Scraping Basics for Social Work Researchers: Methods and Applications
Many websites offer Application Programming Interfaces (APIs) that provide structured access to their data. Utilizing APIs is often more straightforward and aligns with the data provider’s terms of service. However, not all websites offer APIs, or they may have limitations in terms of available data that necessitate direct web scraping. Web scraping is a tool that enables the collection of large-scale data from online sources.
Understanding Web Scraping
Web scraping is the automated extraction of information from websites. You need to understand the following elements for web-scraping:
- HTML Structure: Web pages are built using HTML, which organizes content into elements like headers, paragraphs, and tables. Understanding this structure is essential for extracting relevant data.
- Static vs. Dynamic Pages: Static pages are straightforward to scrape using the
Beautifulsoup
module, while dynamic pages (powered by JavaScript) may requireSelenium
module to simulate user interactions.BeautifulSoup
: For parsing HTML.Selenium
: For interacting with dynamic websites.Scrapy
: For building large-scale scraping projects.
- Special Cases: Handling login requirements and navigating CAPTCHA systems.
Application
Here, I am sharing examples of using web-scraping for studies on service access, public opinion, and non-profit/workforce research, providing insights that traditional data collection methods may not capture.
Applications in Social/Health Services Research
Web scraping can be employed to gather information on the availability and distribution of social/health services. By extracting data from the websites of service providers, researchers can analyze factors such as geographic coverage, service types, and accessibility. For instance, scraping data from provider directories can reveal disparities in service availability across different regions, informing policy decisions and resource allocation. Further, some directories include review data that researchers can utilize for the client’s/patients’ satisfaction.
- Hu, D., Liu, C. M. H., Hamdy, R., Cziner, M., Fung, M., Dobbs, S., … & Broniatowski, D. A. (2021). Questioning the Yelp Effect: mixed methods analysis of web-based reviews of urgent cares. Journal of Medical Internet Research, 23(10), e29406. – Scraped Google Reviews on Urgan Care Facilities
- Chandrasekaran, R., Bapat, P., Jeripity Venkata, P., & Moustakas, E. (2023). Do Patients Assess Physicians Differently in Video Visits as Compared with In-Person Visits? Insights from Text-Mining Online Physician Reviews. Telemedicine and e-Health, 29(10), 1557-1565. – Scraped Zocdoc Reviews on Physicians
Applications in Public Opinion Research
In public opinion research, web scraping allows for the collection of user-generated content from social media platforms, forums, and blogs. Analyzing this data can provide insights into public perceptions and attitudes toward social issues. For example, scraping comments from online discussions about mental health can help identify prevailing sentiments and stigma, guiding public health interventions.
- Qureshi, S. P., Judson, E., Cummins, C., Gadoud, A., Sanders, K., & Doherty, M. (2024). Resisting the (re-)medicalisation of dying and grief in the post-digital age: Natural language processing and qualitative analysis of data from internet support forums. Social Science & Medicine (1982), 348(116517), 116517.
- Ramachandran, S., Brown, L., & Ring, D. (2022). Tones and themes in Reddits posts discussing the opioid epidemic. Journal of Addictive Diseases, 40(4), 552–558.
Application in Non-profit/Workforce Research
By scraping job boards and company career pages, researchers can collect data on job titles, descriptions, required qualifications, and locations. This information helps identify in-demand skills, emerging job roles, and regional employment patterns. Such analyses are crucial for workforce development and policy-making.
- Krasna, H., Czabanowska, K., Beck, A., Cushman, L. F., & Leider, J. P. (2021). Labour market competition for public health graduates in the United States: A comparison of workforce taxonomies with job postings before and during the COVID‐19 pandemic. The International Journal of Health Planning and Management, 36(S1), 151-167. – utilizing job posting data
- Krasna, H. (2024). Employer demand and desired skills for Public Health graduates: Evidence from job postings. American Journal of Public Health, 114(12), 1388–1393. https://doi.org/10.2105/AJPH.2024.307834
Scraping data from crowdfunding platforms like GoFundMe allows researchers to examine fundraising trends, campaign success factors, and community support dynamics. This approach aids in understanding how individuals and groups mobilize resources for various causes, informing strategies for effective fundraising and community engagement.
- Igra, M., Kenworthy, N., Luchsinger, C., & Jung, J.-K. (2021). Crowdfunding as a response to COVID-19: Increasing inequities at a time of crisis. Social Science & Medicine (1982), 282(114105), 114105. https://doi.org/10.1016/j.socscimed.2021.114105
- Silver, E. R., Truong, H. Q., Ostvar, S., Hur, C., & Tatonetti, N. P. (2020). Association of Neighborhood Deprivation Index With Success in Cancer Care Crowdfunding. JAMA Network Open, 3(12), e2026946.
Ethical Considerations
When engaging in web scraping, it’s essential to consider the legal and ethical implications.
- Reviewing a website’s terms of service and
robots.txt
file can provide guidance on permissible data extraction. - Additionally, understanding the fair use doctrine is important, especially when scraping copyrighted content. Fair use considerations include the purpose of use, the nature of the work, the amount used, and the effect on the market value of the original work. For instance, using data for educational or research purposes may fall under fair use, but it’s advisable to consult legal experts/librarians when in doubt.
U Michigan provides the copyright guide on web scraping: https://ai.umich.edu/blog-posts/grabbing-data-from-the-web-our-copyright-guide-outlines-what-you-need-to-know-about-web-scraping-web-crawling-and-apis/
The Fair Use Checklist by Columbia University is also useful to determine if your web scraping is considered fair use or not, considering 1) purpose, 2) nature, 3) amount, and 4) effect: https://copyright.columbia.edu/basics/fair-use/fair-use-checklist.html
In general, conducting research is considered a nonprofit purpose (as long as you don’t have any conflict of interest) when 1) data is publicly available, 2) you scrape only a small portion of the data, and 3) your work is not going to impact the market.
Please find this article by Brown et al. (2024) on the summary of legal, ethical, institutional, and scientific considerations on web-scraping for research: https://arxiv.org/abs/2410.23432
Web-scraping Learning Resources
To develop web scraping skills, consider the following courses from Udemy:
- The Ultimate Web Scraping with Python Bootcamp: This course covers the fundamentals of web scraping using Python, including handling various challenges encountered during the process.
- Web Scraping Course in Python (BS4, Selenium, and Scrapy): This course offers comprehensive training on using BeautifulSoup, Selenium, and Scrapy for web scraping tasks.