Instructor Notes
This is a placeholder file. Please add content here.
What is Web Scraping?
Instructor Note
- Sign in
- Setup
- Introduce self
- Code of conduct
- Intro questions
- Do you have a specific web scraping application in mind?
- Is there anything in particular which you would like to learn in this session?
- What is web scraping?
- Extracting information from websites
- Manual - faster to automate
- Collect data in usable format, e.g. .cvsv
- Similar to web indexing - more targeted
Example - Need to understand structure of a webpage
Instructor Note
- Point out:
- Looks quite well ordered
- Refine search
- Export options
- Human can easily work out what data represents
- Computer needs more information
- Slide - html
- Can see reasonably well structured
- Looks like table of data on webpage - quite different in the code
- Could get computer to pick out specific information
Instructor Note
- Canadian MPs - well structured
- What if data isn’t organised in such an obvious way?
- Unstructured
- British MPs
- Similar data - no way to download
- Slide -> UK MPs html
- Data structured for viewing - table of cards
- Less easy to see how data would be gathered
- Process automated by web scraping
- Slide -> definition
Instructor Note
- Show UK MP tools
- Make sure really necessary to web scrape
Anatomy of a web page
Instructor Note
- html - uses tags to organise and format content.
- Slide - ask what people think it will do
- structured doc, elements marked by tags
- attributes - modify behaviour, appearance or functionality
- list of tags in notes
Instructor Note
- CSS gives separation between display format and content
- uses rules applied to html elements by selectors
- can be useful for targeting elements when scraping
Slide - CSS added to head tag
Instructor Note
- Web pages tend to be generated by other programs
- Adds to the complexity
- Source of this page initially seems complex
- hunt for the tags
- Ctrl+f to search
Instructor Note
- Back to Canadian MPs page
- May need to expand elements
- Member data stored in table table, td and tr tags
- Hover over element in console to highlight on page
Instructor Note
- View Page Source shows html from when the page is loaded
- Selecting 2015 triggers the script
- Inspect then displays the rendered html
Manually scrape data using browser extensionsUsing the Web Scraper Chrome extension
Instructor Note
Web Scraper works with a Sitemap
Should see sitemap generated by previous exercise
May need to spend time inspecting code to decide on best selectors to use
-
Create new sitemap
- Can just select pages 1 and 2 OR
- Use search to select Lib Dem so that only 4 pages
-
Add new selector
- Show different selectors available
Instructor Note
- Information is on selected page
- Click on selector - breadcrumb changes
Instructor Note
- Slide - code for Diane Abbott email address
- 2nd contact-line
- only one for Jack Abbott
- on inspection can search for ‘Email Address’