Instructor Notes
This is a placeholder file. Please add content here.
Hello-Scraping
Instructor Note
Slide Reminder of what web scraping is
Remember - may be APIs or tools to download data
Slide Overview of html - ask what people think it will do - html - uses tags to organise and format content.
Slide - showing structure - structured doc, elements marked by tags - attributes - modify behaviour, appearance or functionality Slide list of tags - also in notes
Instructor Note
Jupyter Lab:
Right click on cell -> Open Variable Inspector (Jupyter Lab, not Notebook)
Shift+Enter - run cell
Instructor Note
- Parse html using BeautifulSoup
- Creates object with nested structure
- Show without prettify first
Instructor Note
For demonstration, leave out numbers and execute one at a time
Instructor Note
- use .get() to access attribute value
- attribute name as parameter
- .get(‘href’)
Instructor Note
Alternative if use links list created in previous section:
PYTHON
first_link = {"element":str(links[0]),
"url":links[0].get('href'),
"text":links[0].get_text()}
NB. Need str on the element otherwise get the BeautifulSoup object not the actual string. Looks ok at this stage but doesn’t load into data frame correctly if not a string.
Instructor Note
- Wrap up BeautifulSoup intro
- Code to:
- Extract all hyperlink elements in structured way
- tag, url & display text
- Use links list already created
Instructor Note
- Create DataFrame column titles:
- links_df = pd.DataFrame(link_info_list, columns = [‘element’, ‘url’,‘text’])
- index - Write row names, default True
Instructor Note
- Much more functionality than shown here
- Can traverse the tree e.g. parent and sibling functions
- Also useful for extracting information from any html document e.g. OCR output
Scraping a real website
Instructor Note
- Using ‘requests’ package
- Load package
- Get url: .get(url)
- Get html content: .text
- tqdm is a Progress Bar library
- regex: \s* means 0 or more spaces
Instructor Note
Truncated so not too long
Point out meta, link and script tags
Look at Upcoming webpage
Look at source code
Instructor Note
- Difficult to find things
- Search for “Upcoming workshops”
- Take a look - difficult to work out
- Show ‘Inspect’
Instructor Note
Inspect first workshop location
Show a tag
Show surrounding tags - contained in h3
Show expand div tags
Locate next location link
Use BeautifulSoup to parse html
Instructor Note
- Use find_all to get h3 tags
- enumerate() function adds a counter to each item in a list or any other iterable, and returns a list of tuples containing the index position and the element for each element of the iterable.
Instructor Note
- Sometimes useful to search by class
- May allow more specific selection
Instructor Note
Get students to look for parent of first h3 tag
Demonstrate hovering over elements
Demonstrate collapsing elements
Important Understanding tree structure
parent div has class attribute p-8 mb-5 border
Instructor Note
- Show code below with print(str(div_firsth3)) first
- Then prettify
Instructor Note
Examine output
h3 gives link to workshop website
Also get extra info - date, format, country etc
Can start to extract more information
Instructor Note
Ask - what if we want to find the information for all of the workshops?
-
Reuse code for dict_workshop
- change div_firsth3 to item
Instructor Note
All data so far taken from single page
Information might be across several pages
May need to follow hyperlinks
Can loop using request for each link & parse with BeautifulSoup
Beware of sending too many requests & overloading web server
Use sleep() function to wait
Instructor Note
Now extract information from each workshop’s website
Use links in the ‘upcomingworkshops_df’ DataFrame
Use link to first workshop
df.loc - access group of rows and cols by label or boolean array
Instructor Note
- Follow link to first page
- View source
- Information in ‘head’ not displayed directly on page
- ‘meta’ tag - metadata - used by search engines
- contains useful information
- Extract metadata for first 5 websites
Instructor Note
- May get error if web page is unavailable
- requests returns value
- 404 if unavailable
- 200 if found
- Show test for code
Dynamic websites
Instructor Note
Go to films website
Select 2015
View page source
Can you find ‘Spotlight’?
JavaScript used to load instructions dynamically
Use Inspect to look at Spotlight
Requests only accesses original html
Need Selenium package
Instructor Note
- Selenium
- for web browser automation
- behaves like real user, interacting with web page in browser
- renders web page, loading dynamic content
- access full html after JavaScript executed
- can also simulate interactions - e.g. clicking buttons
Instructor Note
- webdriver - launch, simulate and interact with web brower through code
- By - how locate elements
Instructor Note
- Selenium works with other browsers too
- Chrome window will open don’t close
- may need to open from bottom toolbar
- Split window to see both
Instructor Note
- Slide find instructions
- Inspect to find 2015 tag
- id should be unique so use that
Instructor Note
- use .click() method to interact with button
- pause using sleep to give time to load table
- get page source
- close
Instructor Note
html_2015 contains dynamically loaded html
-
could use Selenium’s find_element()
- switch back to BeautifulSoup
Show Inspect of Spotlight to locate tags
To get film title:
Instructor Note
Headless mode - browser runs in background without opening window
Need to set up option for headless mode
Open webpage
Click to load 2015 data
-
Extract information from table one column at a time
- Each column has unique class attribute
Use list comprehensions to extract data
For best picture need to check if element is there
Add to dataframe
Instructor Note
Slide to show pipeline steps