Instructor Notes

This is a placeholder file. Please add content here.

Hello-Scraping


Instructor Note

Slide Reminder of what web scraping is

Remember - may be APIs or tools to download data

Slide Overview of html - ask what people think it will do - html - uses tags to organise and format content.

Slide - showing structure - structured doc, elements marked by tags - attributes - modify behaviour, appearance or functionality Slide list of tags - also in notes



Instructor Note

Jupyter Lab:

Right click on cell -> Open Variable Inspector (Jupyter Lab, not Notebook)

Shift+Enter - run cell



Instructor Note

  • Parse html using BeautifulSoup
  • Creates object with nested structure
  • Show without prettify first


Instructor Note

For demonstration, leave out numbers and execute one at a time



Instructor Note

  • use .get() to access attribute value
  • attribute name as parameter
  • .get(‘href’)


Instructor Note

Alternative if use links list created in previous section:

PYTHON

first_link = {"element":str(links[0]),
              "url":links[0].get('href'),
              "text":links[0].get_text()}

NB. Need str on the element otherwise get the BeautifulSoup object not the actual string. Looks ok at this stage but doesn’t load into data frame correctly if not a string.



Instructor Note

  • Wrap up BeautifulSoup intro
  • Code to:
    • Extract all hyperlink elements in structured way
    • tag, url & display text
  • Use links list already created


Instructor Note

  • Create DataFrame column titles:
    • links_df = pd.DataFrame(link_info_list, columns = [‘element’, ‘url’,‘text’])
  • index - Write row names, default True


Instructor Note

  • Much more functionality than shown here
  • Can traverse the tree e.g. parent and sibling functions
  • Also useful for extracting information from any html document e.g. OCR output


Scraping a real website


Instructor Note

  • Using ‘requests’ package
  • Load package
  • Get url: .get(url)
  • Get html content: .text
  • tqdm is a Progress Bar library
  • regex: \s* means 0 or more spaces


Instructor Note

  • Truncated so not too long

  • Point out meta, link and script tags

  • Look at Upcoming webpage

  • Look at source code



Instructor Note

  • Difficult to find things
  • Search for “Upcoming workshops”
  • Take a look - difficult to work out
  • Show ‘Inspect’


Instructor Note

  • Inspect first workshop location

  • Show a tag

  • Show surrounding tags - contained in h3

  • Show expand div tags

  • Locate next location link

  • Use BeautifulSoup to parse html



Instructor Note

  • Use find_all to get h3 tags
  • enumerate() function adds a counter to each item in a list or any other iterable, and returns a list of tuples containing the index position and the element for each element of the iterable.


Instructor Note

  • Sometimes useful to search by class
  • May allow more specific selection


Instructor Note

  • Get students to look for parent of first h3 tag

  • Demonstrate hovering over elements

  • Demonstrate collapsing elements

  • Important Understanding tree structure

  • parent div has class attribute p-8 mb-5 border



Instructor Note

  • Show code below with print(str(div_firsth3)) first
  • Then prettify


Instructor Note

  • Examine output

  • h3 gives link to workshop website

  • Also get extra info - date, format, country etc

  • Can start to extract more information



Instructor Note

  • Ask - what if we want to find the information for all of the workshops?

  • Reuse code for dict_workshop

    • change div_firsth3 to item


Instructor Note

  • All data so far taken from single page

  • Information might be across several pages

  • May need to follow hyperlinks

  • Can loop using request for each link & parse with BeautifulSoup

  • Beware of sending too many requests & overloading web server

  • Use sleep() function to wait



Instructor Note

  • Now extract information from each workshop’s website

  • Use links in the ‘upcomingworkshops_df’ DataFrame

  • Use link to first workshop

  • df.loc - access group of rows and cols by label or boolean array



Instructor Note

  • Follow link to first page
  • View source
  • Information in ‘head’ not displayed directly on page
  • ‘meta’ tag - metadata - used by search engines
    • contains useful information
  • Extract metadata for first 5 websites


Instructor Note

  • May get error if web page is unavailable
  • requests returns value
    • 404 if unavailable
    • 200 if found
  • Show test for code


Dynamic websites


Instructor Note

  • Go to films website

  • Select 2015

  • View page source

  • Can you find ‘Spotlight’?

  • JavaScript used to load instructions dynamically

  • Use Inspect to look at Spotlight

  • Requests only accesses original html

  • Need Selenium package



Instructor Note

  • Selenium
    • for web browser automation
    • behaves like real user, interacting with web page in browser
    • renders web page, loading dynamic content
    • access full html after JavaScript executed
    • can also simulate interactions - e.g. clicking buttons


Instructor Note

  • webdriver - launch, simulate and interact with web brower through code
  • By - how locate elements


Instructor Note

  • Selenium works with other browsers too
  • Chrome window will open don’t close
    • may need to open from bottom toolbar
    • Split window to see both


Instructor Note

  • Slide find instructions
  • Inspect to find 2015 tag
  • id should be unique so use that


Instructor Note

  • use .click() method to interact with button
  • pause using sleep to give time to load table
  • get page source
  • close


Instructor Note

  • html_2015 contains dynamically loaded html

  • could use Selenium’s find_element()

    • switch back to BeautifulSoup
  • Show Inspect of Spotlight to locate tags

  • To get film title:

PYTHON

title = soup.find(class_='film').find(class_='film-title').get_text()
print(title)


Instructor Note

  • Headless mode - browser runs in background without opening window

  • Need to set up option for headless mode

  • Open webpage

  • Click to load 2015 data

  • Extract information from table one column at a time

    • Each column has unique class attribute
  • Use list comprehensions to extract data

  • For best picture need to check if element is there

  • Add to dataframe



Instructor Note

Slide to show pipeline steps