Content from What is Web Scraping?
Last updated on 2025-12-09 | Edit this page
Estimated time: 0 minutes
Overview
Questions
- What is web scraping and why is it useful?
- What are typical use cases for web scraping?
Objectives
After completing this episode, participants should be able to…
- Be able to navigate around a website, understanding the concept of structured data
- Discuss how data can be extracted from web pages
What is web scraping?
Web scraping is a technique for extracting information from websites. This can be done manually but it is usually faster, more efficient and less error-prone to automate the task.
Web scraping allows you to acquire non-tabular or poorly structured data from websites and convert it into a usable, structured format, such as a .csv file or spreadsheet.
Scraping is about more than just acquiring data: it can also help you archive data and track changes to data online.
It is closely related to the practice of web indexing, which is what search engines like Google do when mass-analysing the Web to build their indices. But contrary to web indexing, which typically parses the entire content of a web page to make it searchable, web scraping targets specific information on the pages visited.
For example, online stores will often scour the publicly available pages of their competitors, scrape item prices, and then use this information to adjust their own prices. Another common practice is “contact scraping” in which personal information like email addresses or phone numbers is collected for marketing purposes.
Web scraping is also increasingly being used by scholars to create data sets for text mining projects; these might be collections of journal articles or digitised texts. The practice of data journalism, in particular, relies on the ability of investigative journalists to harvest data that is not always presented or published in a form that allows analysis.
Before you get started
As useful as scraping is, there might be better options for the task. Choose the right (i.e. the easiest) tool for the job.
- Check whether or not you can easily copy and paste data from a site into Excel or Google Sheets. This might be quicker than scraping.
- Check if the site or service already provides an API to extract structured data. If it does, that will be a much more efficient and effective pathway. Good examples are the Facebook API, the X APIs or the YouTube comments API.
- For much larger needs, Freedom of information Act (FOIA) requests can be useful. Be specific about the formats required for the data you want.
Note that the ‘Before you get started’ items will be covered in more detail later
Example: Scraping parliamentary websites for contact information
In this workshop, we will learn how to extract information from various web pages. Different webpages can have widely differing formats which will affect our decisions as to which method of scraping data might be appropriate.
Before we can make such decisions we need to have some understanding of the makeup of a webpage. Let’s start by looking at the list of members of the Canadian parliament, which is available on the Parliament of Canada website
This is how this page appears in December 2025:


There are several features (circled in the image above) that make the data on this page easier to work with. The search, reorder, refine features and display modes hint that the data is actually stored in a (structured) database before being displayed on this page. The data can be readily downloaded either as a comma separated values (.csv) file or as XML for re-use in their own database, spreadsheet or computer program.
Even though the information displayed in the view above is not labelled, anyone visiting this site with some knowledge of Canadian geography and politics can see what information pertains to the politicians’ names, the geographical area they come from and the political party they represent. This is because human beings are good at using context and prior knowledge to quickly categorise information.
Computers, on the other hand, cannot do this unless we provide them with more information. If we examine the source HTML code of this page, we can see that the information displayed has a consistent structure:
HTML
(...)
<tr role="row" id="mp-list-id-25446">
<td data-sort="Allison Dean" class="sorting_1">
<a href="/members/en/dean-allison(25446)">
Allison, Dean
</a>
</td>
<td data-sort="Conservative">Conservative</td>
<td data-sort="Niagara West">
<a href="/members/en/constituencies/niagara-west(1124)">Niagara West</a>
</td>
<td data-sort="Ontario">Ontario</td>
</tr>
(...)
Using this structure, we may be able to instruct a computer to look for all parliamentarians from Alberta and list their names and caucus information.
Structured vs unstructured data
When presented with information, human beings are good at quickly categorizing it and extracting the data that they are interested in. For example, when we look at a magazine rack, provided the titles are written in a script that we are able to read, we can rapidly figure out the titles of the magazines, the stories they contain, the language they are written in, etc. and we can probably also easily organize them by topic, recognize those that are aimed at children, or even whether they lean toward a particular end of the political spectrum. Computers have a much harder time making sense of such unstructured data unless we specifically tell them what elements data is made of, for example by adding labels such as this is the title of this magazine or this is a magazine about food. Data in which individual elements are separated and labelled is said to be structured.
Let’s look now at the current list of members for the UK House of Commons.

This page also displays a list of names, political and geographical affiliation. There is a search box and a filter option, but no obvious way to download this information and reuse it.
Here is the code for this page:
HTML
(...)
<a class="card card-member" href="/member/172/contact">
<div class="card-inner">
<div class="content">
<div class="image-outer">
<div class="image"
aria-label="Image of Ms Diane Abbott"
style="background-image: url(https://members-api.parliament.uk/api/Members/172/Thumbnail); border-color: #909090;"></div>
</div>
<div class="primary-info">
Ms Diane Abbott
</div>
<div class="secondary-info">
Independent
</div>
</div>
<div class="info">
<div class="info-inner">
<div class="indicators-left">
<div class="indicator indicator-label">
Hackney North and Stoke Newington
</div>
</div>
<div class="clearfix"></div>
</div>
</div>
</div>
</a>
(...)
We see that this data has been structured for displaying purposes (it is arranged in rows inside a table) but the different elements of information are not clearly labelled.
What if we wanted to download this dataset and, for example, compare it with the Canadian list of MPs to analyze gender representation, or the representation of political forces in the two groups? We could try copy-pasting the entire table into a spreadsheet or even manually copy-pasting the names and parties in another document, but this can quickly become impractical when faced with a large set of data. What if we wanted to collect this information for every country that has a parliamentary system?
Fortunately, there are tools to automate at least part of the process. This technique is called web scraping.
“Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.” (Source: Wikipedia)
Web scraping typically targets one web site at a time to extract unstructured information and put it in a structured form for reuse.
In this lesson, we will continue exploring the examples above and try different techniques to extract the information they contain. But before we launch into web scraping proper, we need to look a bit more closely at how information is organized within an HTML document and how to build queries to access a specific subset of that information.
References
- Humans are good at categorizing information, computers not so much.
- Often, data on a web site is not properly structured, making its extraction difficult.
- Web scraping is the process of automating the extraction of data from web sites.
Content from Anatomy of a web page
Last updated on 2025-12-09 | Edit this page
Estimated time: 0 minutes
Overview
Questions
- What’s behind a website, and how can I extract information from it?
- How can I find the code for a specific element on a web page?
Objectives
After completing this episode, participants should be able to… - Identify the structure and key components of an HTML document. - Explain how to use the browser developer tools to view the underlying html content of a web page - Use the browser developer tool to find the html code for specific items on a web page
Introduction
Here, we’ll revisit some of those core ideas to build a more hands-on understanding of how content and data are structured on the web. We’ll start by exploring what HTML (Hypertext Markup Language) is and how it uses tags to organize and format content. Then, we’ll introduce the BeautifulSoup library to parse HTML and make it easier to search for and extract specific elements from a webpage.
We’ll begin with simple examples and gradually move on to scraping more complex, real-world websites.
HTML quick overview
All websites have a Hypertext Markup Language (HTML) document behind them. Below is an example of HTML for a very simple webpage that contains just three sentences. As you look through it, try to imagine how the website would appear in a browser.
HTML
<!DOCTYPE html>
<html>
<head>
<title>Sample web page</title>
</head>
<body>
<h1>h1 Header #1</h1>
<p>This is a paragraph tag</p>
<h2>h2 Sub-header</h2>
<p>A new paragraph, now in the <b>sub-header</b></p>
<h1>h1 Header #2</h1>
<p>
This other paragraph has two hyperlinks,
one to <a href="https://carpentries.org/">The Carpentries homepage</a>,
and another to the
<a href="https://carpentries.org/workshops/past-workshops/">past workshops</a> page.
</p>
</body>
</html>
If you save that text in a file with a .html extension —using a simple text editor like Notepad on Windows or TextEdit on macOS— and open it in your web browser, the browser will interpret the markup language and display a nicely formatted web page.
When you open an HTML file in your browser, what it’s really doing is
reading a structured document made up of elements, each
marked by tags inside angle brackets (< and >).
For instance, the HTML root element, which delimits the beginning and
end of an HTML document, is identified by the <html>
tag.
Most elements have both an opening tag and a closing tag, which
define the start and end of that element. For example, in the simple
website we looked at earlier, the head element begins with
<head> and ends with </head>.
Because elements can be nested inside one another, an HTML document forms a tree structure, where each element is a node that can contain child nodes, as illustrated in the image below.
Finally, we can define or modify the behavior, appearance, or
functionality of an element using attributes.
Attributes appear inside the opening tag and consist of a name and a
value, formatted like name="value".
For example, in the simple website, we added a hyperlink using the
<a>...</a> tags. To specify the destination
URL, we used the href attribute inside the opening
<a> tag like this:
<a href="https://carpentries.org/workshops/past-workshops/">past workshops</a>.
Here is a non-exhaustive list of common HTML elements and their purposes:
-
<hmtl>...</html>: The root element that contains the entire document. -
<head>...</head>: Contains metadata such as the page title that the browser displays. -
<body>...</body>: Contains the content that will be shown on the webpage. -
<h1>...</h1>, <h2>...</h2>, <h3>...</h3>: Define headers of levels 1, 2, 3, and so on. -
<p>...</p>: Represents a paragraph. -
<a href="">...</a>: Creates a hyperlink; the destination URL is set with the href attribute. -
<img src="" alt="">: Embeds an image, with the image source specified bysrcand alternative text provided byalt. It doesn’t have an opening tag. -
<table>...</table>, <th>...</th>, <tr>...</tr>, <td>...</td>: Define a table structure, with headers (<th>), rows (<tr>), and cells (<td>). -
<div>...</div>: Groups sections of HTML content together. -
<script>...</script>: Embeds or links to JavaScript code.
In the list above, we mentioned some attributes specific to hyperlink
(<a>) and image (<img>) elements,
but there are also several global attributes that most HTML elements can
have. These are especially useful for identifying elements when web
scraping:
-
id="": Assigns a unique identifier to an element; this ID must be unique within the entire HTML document. -
title="": Provides extra information about the element, shown as a tooltip when the user hovers over it. -
class="": Applies a common styling or grouping to multiple elements at once.
To summarize: elements are identified by tags, and attributes let us assign properties or identifiers to those elements. Understanding this structure will make it much easier to extract specific data from a website.
- Every website is built on an HTML document that structures its content.
- An HTML document is composed of elements, usually defined by an
opening
and a closing - Elements can have attributes that define their properties, written
as
.
Content from Manually scrape data using browser extensions
Last updated on 2025-12-04 | Edit this page
Estimated time: 0 minutes
Overview
Questions
- How can I get started scraping data off the web?
- How do I assess the most appropriate method to scrape data?
Objectives
After completing this episode, participants should be able to…
- Understand the different tools for accessing web page data
- Use the WebScraper tool to extract data from a web page
- Assess the appropriate method for gathering the required data
Introduction
This is a lesson created via The Carpentries Workbench. It is written in Pandoc-flavored Markdown for static files and R Markdown for dynamic files that can render code into output. Please refer to the Introduction to The Carpentries Workbench for full documentation.
What you need to know is that there are three sections required for a valid Carpentries lesson:
-
questionsare displayed at the beginning of the episode to prime the learner for the content. -
objectivesare the learning objectives for an episode displayed with the questions. -
keypointsare displayed at the end of the episode to reinforce the objectives.
Inline instructor notes can help inform instructors of timing challenges associated with the lessons. They appear in the “Instructor View”
Challenge 1: Can you do it?
What is the output of this command?
R
paste("This", "new", "lesson", "looks", "good")
OUTPUT
[1] "This new lesson looks good"
Challenge 2: how do you nest solutions within challenge blocks?
You can add a line with at least three colons and a
solution tag.
Figures
You can use standard markdown for static figures with the following syntax:
{alt='alt text for accessibility purposes'}
Callout sections can highlight information.
They are sometimes used to emphasise particularly important points but are also used in some lessons to present “asides”: content that is not central to the narrative of the lesson, e.g. by providing the answer to a commonly-asked question.
Math
One of our episodes contains \(\LaTeX\) equations when describing how to create dynamic reports with {knitr}, so we now use mathjax to describe this:
$\alpha = \dfrac{1}{(1 - \beta)^2}$ becomes: \(\alpha = \dfrac{1}{(1 - \beta)^2}\)
Cool, right?
- Data that is relatively well structured (in a table) is relatively easily to scrape.
- Tools may be available on a web page which enable data to be downloaded directly.
Content from Ethics and Legality of Web Scraping
Last updated on 2025-12-04 | Edit this page
Estimated time: 0 minutes
Overview
Questions
- When is web scraping OK and when is it not?
- Is web scraping legal? Can I get into trouble?
- What are some ethical considerations to make?
- What can I do with the data that I’ve scraped?
Objectives
After completing this episode, participants should be able to…
- Discuss the legal and ethical implications of web scraping
- Establish a code of conduct
Introduction
This is a lesson created via The Carpentries Workbench. It is written in Pandoc-flavored Markdown for static files and R Markdown for dynamic files that can render code into output. Please refer to the Introduction to The Carpentries Workbench for full documentation.
What you need to know is that there are three sections required for a valid Carpentries lesson:
-
questionsare displayed at the beginning of the episode to prime the learner for the content. -
objectivesare the learning objectives for an episode displayed with the questions. -
keypointsare displayed at the end of the episode to reinforce the objectives.
Inline instructor notes can help inform instructors of timing challenges associated with the lessons. They appear in the “Instructor View”
Challenge 1: Can you do it?
What is the output of this command?
R
paste("This", "new", "lesson", "looks", "good")
OUTPUT
[1] "This new lesson looks good"
Challenge 2: how do you nest solutions within challenge blocks?
You can add a line with at least three colons and a
solution tag.
Figures
You can use standard markdown for static figures with the following syntax:
{alt='alt text for accessibility purposes'}
Callout sections can highlight information.
They are sometimes used to emphasise particularly important points but are also used in some lessons to present “asides”: content that is not central to the narrative of the lesson, e.g. by providing the answer to a commonly-asked question.
Math
One of our episodes contains \(\LaTeX\) equations when describing how to create dynamic reports with {knitr}, so we now use mathjax to describe this:
$\alpha = \dfrac{1}{(1 - \beta)^2}$ becomes: \(\alpha = \dfrac{1}{(1 - \beta)^2}\)
Cool, right?
- Web scraping is, in general, legal and won’t get you into trouble.
- There are a few things to be careful about, notably don’t overwhelm a web server and don’t steal content.
- Be nice. In doubt, ask.