Instructor Notes

This is a placeholder file. Please add content here.

What is Web Scraping?


Instructor Note

  • Sign in
  • Setup
  • Introduce self
  • Code of conduct
  • Intro questions
      1. Do you have a specific web scraping application in mind?
      1. Is there anything in particular which you would like to learn in this session?
  • What is web scraping?
    • Extracting information from websites
    • Manual - faster to automate
    • Collect data in usable format, e.g. .cvsv
    • Similar to web indexing - more targeted
  • Example - Need to understand structure of a webpage



Instructor Note

  • Point out:
    • Looks quite well ordered
    • Refine search
    • Export options
    • Human can easily work out what data represents
    • Computer needs more information
  • Slide - html
    • Can see reasonably well structured
    • Looks like table of data on webpage - quite different in the code
    • Could get computer to pick out specific information


Instructor Note

  • Canadian MPs - well structured
  • What if data isn’t organised in such an obvious way?
    • Unstructured
  • British MPs
  • Similar data - no way to download
  • Slide -> UK MPs html
    • Data structured for viewing - table of cards
    • Less easy to see how data would be gathered
  • Process automated by web scraping
  • Slide -> definition


Instructor Note

  • Show UK MP tools
    • Make sure really necessary to web scrape


Anatomy of a web page


Instructor Note

  • html - uses tags to organise and format content.
  • Slide - ask what people think it will do
  • structured doc, elements marked by tags
  • attributes - modify behaviour, appearance or functionality
  • list of tags in notes


Instructor Note

  • CSS gives separation between display format and content
  • uses rules applied to html elements by selectors
  • can be useful for targeting elements when scraping

Slide - CSS added to head tag



Instructor Note

  • Web pages tend to be generated by other programs
    • Adds to the complexity
  • Source of this page initially seems complex
    • hunt for the tags
    • Ctrl+f to search


Instructor Note

  • Back to Canadian MPs page
  • May need to expand elements
  • Member data stored in table table, td and tr tags
  • Hover over element in console to highlight on page


Instructor Note

  • View Page Source shows html from when the page is loaded
  • Selecting 2015 triggers the script
    • Inspect then displays the rendered html


Manually scrape data using browser extensionsUsing the Web Scraper Chrome extension


Instructor Note

  • Web Scraper works with a Sitemap

  • Should see sitemap generated by previous exercise

  • May need to spend time inspecting code to decide on best selectors to use

  • Create new sitemap

    • Can just select pages 1 and 2 OR
    • Use search to select Lib Dem so that only 4 pages
  • Add new selector

    • Show different selectors available


Instructor Note

  • Information is on selected page
    • Click on selector - breadcrumb changes


Instructor Note

  • Slide - code for Diane Abbott email address
  • 2nd contact-line
  • only one for Jack Abbott
  • on inspection can search for ‘Email Address’