Summary and Setup
This is a new lesson built with The Carpentries Workbench.
In this workshop, you’ll learn how to extract data from websites using Python — a process known as web scraping.
Episode 1 begins with an introduction to how websites are structured
using HTML. You’ll learn how to explore this structure using your
browser and how to extract information from it using the
BeautifulSoup package.
In Episode 2, you’ll learn how to retrieve the HTML of a webpage
using the requests package and continue practicing how to
parse and extract specific content with BeautifulSoup.
Toward the end of the workshop, in Episode 3, we’ll explore the
difference between static and dynamic webpages, and how to scrape
dynamic content using Selenium.
This workshop is intended for learners who already have a basic understanding of Python. In particular, you should be comfortable with:
- Install and import packages and modules
- Use lists and dictionaries
- Use conditional statements (
if,else,elif) - Use
forloops - Calling functions, understanding parameters/arguments and return values
Software Setup
Python
Python is a popular language for research computing, and great for general-purpose programming as well. Installing all of its research packages individually can be a bit difficult, so we recommend Conda-forge, an all-in-one installer.
Regardless of how you choose to install it, please make sure you install a Python version >= 3.9 (e.g. 3.11 is fine, 3.6 is not).
We will teach Python using the Jupyter Notebook, a programming environment that runs in a web browser (Jupyter Notebook will be installed by Miniforge). For this to work you will need a reasonably up-to-date browser. The current versions of the Chrome, Safari and Firefox browsers are all supported (some older browsers, including Internet Explorer version 9 and below, are not).
Steps:
- If you already have Anaconda, Jupyter Lab or Jupyter Notebooks installed in your computer, skip to step 2.
- Activate the base conda environment by typing and running the code below to activate your environment.
conda activate
- Install the necessary packages by running:
pip install requests beautifulsoup4 selenium webdriver-manager pandas tqdm jupyterlab
- Start Jupyter Lab by running:
jupyter lab
- In a new Jupyter Notebook run the following code in a cell to check the necessary libraries can be loaded:
Data
The html used in Episode 1 and Jupyter notebooks containing the code content can be downloaded. The Jupyter notebooks are intended for reference as ideally, you will create your own notebooks as you follow along during the course.
Additional resources
- Mitchell, R. (Ryan E. ). (2024). Web scraping with Python : data extraction from the modern web (3rd edition.). O’Reilly Media, Inc.
- Chapagain, A. (2023). Hands-On Web Scraping with Python : Extract Quality Data from the Web Using Effective Python Techniques (Second edition.). Packt Publishing.