Web-Scraping with Python for Social Work Research

Web scraping is a powerful tool for social work researchers, enabling the collection of large-scale data from online sources. This method is particularly useful for studies on service access, public opinion, and non-profit/workforce research, providing insights that traditional data collection methods may not capture.

Application

Applications in Social/Health Services Research

Web scraping can be employed to gather information on the availability and distribution of social/health services. By extracting data from the websites of service providers, researchers can analyze factors such as geographic coverage, service types, and accessibility. For instance, scraping data from provider directories can reveal disparities in service availability across different regions, informing policy decisions and resource allocation. Further, some directories include review data that researchers can utilize for the client’s/patients’ satisfaction.

Applications in Public Opinion Research

In public opinion research, web scraping allows for the collection of user-generated content from social media platforms, forums, and blogs. Analyzing this data can provide insights into public perceptions and attitudes toward social issues. For example, scraping comments from online discussions about mental health can help identify prevailing sentiments and stigma, guiding public health interventions.

Application in Non-profit/Workforce Research

By scraping job boards and company career pages, researchers can collect data on job titles, descriptions, required qualifications, and locations. This information helps identify in-demand skills, emerging job roles, and regional employment patterns. Such analyses are crucial for workforce development and policy-making.

Scraping data from crowdfunding platforms like GoFundMe allows researchers to examine fundraising trends, campaign success factors, and community support dynamics. This approach aids in understanding how individuals and groups mobilize resources for various causes, informing strategies for effective fundraising and community engagement.

Understanding Web Scraping

Webscraping & Browser Control with BeautifulSoup and Selenium

Web scraping is the automated extraction of information from websites. You need to understand the following elements for web-scraping:

  • HTML Structure: Web pages are built using HTML, which organizes content into elements like headers, paragraphs, and tables. Understanding this structure is essential for extracting relevant data.
  • Static vs. Dynamic Pages: Static pages are straightforward to scrape using the Beautifulsoup module, while dynamic pages (powered by JavaScript) may require Selenium module to simulate user interactions.
    • BeautifulSoup: For parsing HTML.
    • Selenium: For interacting with dynamic websites.
    • Scrapy: For building large-scale scraping projects.
  • Special Cases: Handling login requirements and navigating CAPTCHA systems.

APIs vs. Non-API Web Scraping

Many websites offer Application Programming Interfaces (APIs) that provide structured access to their data. Utilizing APIs is often more straightforward and aligns with the data provider’s terms of service. However, not all websites offer APIs, or they may have limitations that necessitate direct web scraping. In such cases, it’s crucial to approach data extraction responsibly and ethically.

Ethical Considerations

When engaging in web scraping, it’s essential to consider the legal and ethical implications.

  • Reviewing a website’s terms of service and robots.txt file can provide guidance on permissible data extraction.
  • Additionally, understanding the fair use doctrine is important, especially when scraping copyrighted content. Fair use considerations include the purpose of use, the nature of the work, the amount used, and the effect on the market value of the original work. For instance, using data for educational or research purposes may fall under fair use, but it’s advisable to consult legal experts/librarians when in doubt.

U Michigan provides the copyright guide on web scraping: https://ai.umich.edu/blog-posts/grabbing-data-from-the-web-our-copyright-guide-outlines-what-you-need-to-know-about-web-scraping-web-crawling-and-apis/

The Fair Use Checklist by Columbia University is also useful to determine if your web scraping is considered fair use or not, considering 1) purpose, 2) nature, 3) amount, and 4) effect: https://copyright.columbia.edu/basics/fair-use/fair-use-checklist.html

In general, conducting research is considered a nonprofit purpose (as long as you don’t have any conflict of interest) when 1) data is publicly available, 2) you scrape only a small portion of the data, and 3) your work is not going to impact the market.

Web-scraping Learning Resources

To develop web scraping skills, consider the following courses from Udemy:

  • November 22, 2024