What is Web Scraping?
- Humans are good at categorizing information, computers not so much.
- Often, data on a web site is not properly structured, making its extraction difficult.
- Web scraping is the process of automating the extraction of data from web sites.
- Tools may be available on a web page which enable data to be downloaded directly.
Anatomy of a web page
- Every website is built on an HTML document that structures its content.
- An HTML document is composed of elements, usually defined by an
opening
and a closing . - Elements can have attributes that define their properties, written
as
. - CSS may be used to control the appearance of the rendered webpage.
- Dynamic webpages may have content which isn’t loaded until the item is selected.
Manually scrape data using browser extensionsUsing the Web Scraper Chrome extension
- Data that is relatively well structured (in a table) is relatively easily to scrape.
- More often than not, web scraping tools need to be told what to scrape.
- JQuery can be used to define more precisely what information is to be scraped.
Ethics and Legality of Web Scraping
- Web scraping is, in general, legal and won’t get you into trouble.
- Always review and respect a website’s Terms of Service (TOS) before scraping its content.
- There are a few things to be careful about, notably don’t overwhelm a web server and don’t steal content.
- Be nice. In doubt, ask.