Instructor Notes

This is a placeholder file. Please add content here.

Hello-Scraping

Instructor Note

Slide Reminder of what web scraping is

Remember - may be APIs or tools to download data

Slide Overview of html - ask what people think it will do - html - uses tags to organise and format content.

Slide - showing structure - structured doc, elements marked by tags - attributes - modify behaviour, appearance or functionality Slide list of tags - also in notes

Instructor Note

Jupyter Lab:

Right click on cell -> Open Variable Inspector (Jupyter Lab, not Notebook)

Shift+Enter - run cell

Instructor Note

Parse html using BeautifulSoup
Creates object with nested structure
Show without prettify first

Instructor Note

For demonstration, leave out numbers and execute one at a time

Instructor Note

use .get() to access attribute value
attribute name as parameter
.get(‘href’)

Instructor Note

Alternative if use links list created in previous section:

PYTHON

first_link = {"element":str(links[0]),
              "url":links[0].get('href'),
              "text":links[0].get_text()}

NB. Need str on the element otherwise get the BeautifulSoup object not the actual string. Looks ok at this stage but doesn’t load into data frame correctly if not a string.

Instructor Note

Wrap up BeautifulSoup intro
Code to:
- Extract all hyperlink elements in structured way
- tag, url & display text
Use links list already created

Instructor Note

Create DataFrame column titles:
- links_df = pd.DataFrame(link_info_list, columns = [‘element’, ‘url’,‘text’])
index - Write row names, default True

Instructor Note

Much more functionality than shown here
Can traverse the tree e.g. parent and sibling functions
Also useful for extracting information from any html document e.g. OCR output

Scraping a real website

Instructor Note

Using ‘requests’ package
Load package
Get url: .get(url)
Get html content: .text
tqdm is a Progress Bar library
regex: \s* means 0 or more spaces

Instructor Note

Truncated so not too long
Point out meta, link and script tags
Look at Upcoming webpage
Look at source code

Instructor Note

Difficult to find things
Search for “Upcoming workshops”
Take a look - difficult to work out
Show ‘Inspect’

Instructor Note

Inspect first workshop location
Show a tag
Show surrounding tags - contained in h3
Show expand div tags
Locate next location link
Use BeautifulSoup to parse html

Instructor Note

Use find_all to get h3 tags
enumerate() function adds a counter to each item in a list or any other iterable, and returns a list of tuples containing the index position and the element for each element of the iterable.

Instructor Note

Sometimes useful to search by class
May allow more specific selection

Instructor Note

Get students to look for parent of first h3 tag
Demonstrate hovering over elements
Demonstrate collapsing elements
Important Understanding tree structure
parent div has class attribute p-8 mb-5 border

Instructor Note

Show code below with print(str(div_firsth3)) first
Then prettify

Instructor Note

Examine output
h3 gives link to workshop website
Also get extra info - date, format, country etc
Can start to extract more information

Instructor Note

Ask - what if we want to find the information for all of the workshops?
Reuse code for dict_workshop
- change div_firsth3 to item

Instructor Note

All data so far taken from single page
Information might be across several pages
May need to follow hyperlinks
Can loop using request for each link & parse with BeautifulSoup
Beware of sending too many requests & overloading web server
Use sleep() function to wait

Instructor Note

Now extract information from each workshop’s website
Use links in the ‘upcomingworkshops_df’ DataFrame
Use link to first workshop
df.loc - access group of rows and cols by label or boolean array

Instructor Note

Follow link to first page
View source
Information in ‘head’ not displayed directly on page
‘meta’ tag - metadata - used by search engines
- contains useful information
Extract metadata for first 5 websites

Instructor Note

May get error if web page is unavailable
requests returns value
- 404 if unavailable
- 200 if found
Show test for code

Dynamic websites

Instructor Note

Go to films website
Select 2015
View page source
Can you find ‘Spotlight’?
JavaScript used to load instructions dynamically
Use Inspect to look at Spotlight
Requests only accesses original html
Need Selenium package

Instructor Note

Selenium
- for web browser automation
- behaves like real user, interacting with web page in browser
- renders web page, loading dynamic content
- access full html after JavaScript executed
- can also simulate interactions - e.g. clicking buttons

Instructor Note

webdriver - launch, simulate and interact with web brower through code
By - how locate elements

Instructor Note

Selenium works with other browsers too
Chrome window will open don’t close
- may need to open from bottom toolbar
- Split window to see both

Instructor Note

Slide find instructions
Inspect to find 2015 tag
id should be unique so use that

Instructor Note

use .click() method to interact with button
pause using sleep to give time to load table
get page source
close

Instructor Note

html_2015 contains dynamically loaded html
could use Selenium’s find_element()
- switch back to BeautifulSoup
Show Inspect of Spotlight to locate tags
To get film title:

PYTHON

title = soup.find(class_='film').find(class_='film-title').get_text()
print(title)

Instructor Note

Headless mode - browser runs in background without opening window
Need to set up option for headless mode
Open webpage
Click to load 2015 data
Extract information from table one column at a time
- Each column has unique class attribute
Use list comprehensions to extract data
For best picture need to check if element is there
Add to dataframe

Instructor Note

Slide to show pipeline steps