Introduction to Web Scraping: Key Points

Pre-Alpha

Introduction to Web Scraping

What is Web Scraping?

Humans are good at categorizing information, computers not so much.
Often, data on a web site is not properly structured, making its extraction difficult.
Web scraping is the process of automating the extraction of data from web sites.
Tools may be available on a web page which enable data to be downloaded directly.

Anatomy of a web page

Every website is built on an HTML document that structures its content.
An HTML document is composed of elements, usually defined by an opening and a closing .
Elements can have attributes that define their properties, written as .
CSS may be used to control the appearance of the rendered webpage.
Dynamic webpages may have content which isn’t loaded until the item is selected.

Manually scrape data using browser extensionsUsing the Web Scraper Chrome extension

Data that is relatively well structured (in a table) is relatively easily to scrape.
More often than not, web scraping tools need to be told what to scrape.
JQuery can be used to define more precisely what information is to be scraped.

Ethics and Legality of Web Scraping

Web scraping is, in general, legal and won’t get you into trouble.
Always review and respect a website’s Terms of Service (TOS) before scraping its content.
There are a few things to be careful about, notably don’t overwhelm a web server and don’t steal content.
Be nice. In doubt, ask.