Web Scraping Basics for Social Work Researchers: Methods and Applications

Understanding the use of API for data collection

Many websites offer Application Programming Interfaces (APIs) that provide structured access to their data. Utilizing APIs is often more straightforward and aligns with the data provider’s terms of service. However, not all websites offer APIs, or they may have limitations in terms of available data that necessitate direct web scraping. Web scraping is a tool that enables the collection of large-scale data from online sources.

Understanding Web Scraping

Webscraping & Browser Control with BeautifulSoup and Selenium

Web scraping is the automated extraction of information from websites. You need to understand the following elements for web-scraping:

Understanding the use of web-scraping for digital data collection
  • HTML Structure: Web pages are built using HTML, which organizes content into elements like headers, paragraphs, and tables. Understanding this structure is essential for extracting relevant data.
  • Static vs. Dynamic Pages: Static pages are straightforward to scrape using the Beautifulsoup module, while dynamic pages (powered by JavaScript) may require Selenium module to simulate user interactions.
    • BeautifulSoup: For parsing HTML.
    • Selenium: For interacting with dynamic websites.
    • Scrapy: For building large-scale scraping projects.
  • Special Cases: Handling login requirements and navigating CAPTCHA systems.

Application

Here, I am sharing examples of using web-scraping for studies on service access, public opinion, and non-profit/workforce research, providing insights that traditional data collection methods may not capture.

Applications in Social/Health Services Research

Web scraping can be employed to gather information on the availability and distribution of social/health services. By extracting data from the websites of service providers, researchers can analyze factors such as geographic coverage, service types, and accessibility. For instance, scraping data from provider directories can reveal disparities in service availability across different regions, informing policy decisions and resource allocation. Further, some directories include review data that researchers can utilize for the client’s/patients’ satisfaction.

Applications in Public Opinion Research

In public opinion research, web scraping allows for the collection of user-generated content from social media platforms, forums, and blogs. Analyzing this data can provide insights into public perceptions and attitudes toward social issues. For example, scraping comments from online discussions about mental health can help identify prevailing sentiments and stigma, guiding public health interventions.

Application in Non-profit/Workforce Research

By scraping job boards and company career pages, researchers can collect data on job titles, descriptions, required qualifications, and locations. This information helps identify in-demand skills, emerging job roles, and regional employment patterns. Such analyses are crucial for workforce development and policy-making.

Scraping data from crowdfunding platforms like GoFundMe allows researchers to examine fundraising trends, campaign success factors, and community support dynamics. This approach aids in understanding how individuals and groups mobilize resources for various causes, informing strategies for effective fundraising and community engagement.

Ethical Considerations

When engaging in web scraping, it’s essential to consider the legal and ethical implications.

  • Reviewing a website’s terms of service and robots.txt file can provide guidance on permissible data extraction.
  • Additionally, understanding the fair use doctrine is important, especially when scraping copyrighted content. Fair use considerations include the purpose of use, the nature of the work, the amount used, and the effect on the market value of the original work. For instance, using data for educational or research purposes may fall under fair use, but it’s advisable to consult legal experts/librarians when in doubt.

U Michigan provides the copyright guide on web scraping: https://ai.umich.edu/blog-posts/grabbing-data-from-the-web-our-copyright-guide-outlines-what-you-need-to-know-about-web-scraping-web-crawling-and-apis/

The Fair Use Checklist by Columbia University is also useful to determine if your web scraping is considered fair use or not, considering 1) purpose, 2) nature, 3) amount, and 4) effect: https://copyright.columbia.edu/basics/fair-use/fair-use-checklist.html

In general, conducting research is considered a nonprofit purpose (as long as you don’t have any conflict of interest) when 1) data is publicly available, 2) you scrape only a small portion of the data, and 3) your work is not going to impact the market.

Please find this article by Brown et al. (2024) on the summary of legal, ethical, institutional, and scientific considerations on web-scraping for research: https://arxiv.org/abs/2410.23432

Web-scraping Learning Resources

To develop web scraping skills, consider the following courses from Udemy:

  • November 22, 2024