[Python] How to run selenium in Google Colab

Selenium and Beautifulsoup

Selenium and BeautifulSoup are essential tools for web scraping in Python. They have different functions and capabilities depending on the type of web page you want to scrape. Selenium can handle dynamic pages that require loading more data or clicking specific buttons, while BeautifulSoup can parse the HTML of static pages.

FeatureSeleniumBeautifulSoup
PurposeWeb browser automation tool designed for automated testingPython library built explicitly for scraping structured HTML and XML data
Best suited forComplex projects such as interacting with web pages like a user wouldSmaller projects like parsing HTML and XML documents
Dynamic content handlingYesNo
Limitations⚫ Set-up methods are complex
⚫ Uses more resources compared to BeautifulSoup
⚫ Can become slow when scaling up an application
⚫ Can’t interact with web pages like a human user
-> You will need a different module to scrape JavaScript-rendered web pages since BeautifulSoup only lets you navigate through HTML or XML files

However, Selenium does not work directly on Google Colab, which is a popular platform for running Python code online. You need an extra block of code to load selenium on Colab. In this tutorial, I will show you how to do that. You don’t need this extra step to use beautifulsoup in Google Colab.

How to run selenium in Google Colab

Here is the code to set up for running selenium in Google Colab. You don’t need to run this code if you do it in Jupyter Notebook or other local Python settings.

This is copied from solution by Github user name goljavi from this thread (https://github.com/googlecolab/colabtools/issues/3347).

Python
# Set up for running selenium in Google Colab
## You don't need to run this code if you do it in Jupyter notebook, or other local Python setting
%%shell
sudo apt -y update
sudo apt install -y wget curl unzip
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb
dpkg -i libu2f-udev_1.1.4-1_all.deb
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb
CHROME_DRIVER_VERSION=`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/$CHROME_DRIVER_VERSION/chromedriver_linux64.zip -P /tmp/
unzip -o /tmp/chromedriver_linux64.zip -d /tmp/
chmod +x /tmp/chromedriver
mv /tmp/chromedriver /usr/local/bin/chromedriver
pip install selenium

Now you can install and setup options for using selenium.

Python
!pip install chromedriver-autoinstaller

import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import chromedriver_autoinstaller

# setup chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless') # ensure GUI is off
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# set path to chromedriver as per your configuration
chromedriver_autoinstaller.install()

# set the target URL
url = "put-url-here-to-scrape"

# set up the webdriver
driver = webdriver.Chrome(options=chrome_options)

Then load the selenium webdriver 🙂

Python
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

Then you can now run your code for scraping with selenium driver. Please don’t forget to quit driver at the end of the code.

Python
# quit the driver
driver.quit()
  • June 21, 2023