What is Web Scraping?
- Humans are good at categorizing information, computers not so much.
- Often, data on a web site is not properly structured, making its extraction difficult.
- Web scraping is the process of automating the extraction of data from web sites.
Anatomy of a web page
- Every website is built on an HTML document that structures its content.
- An HTML document is composed of elements, usually defined by an
opening
and a closing - Elements can have attributes that define their properties, written
as
.
Manually scrape data using browser extensions
- Data that is relatively well structured (in a table) is relatively easily to scrape.
- Tools may be available on a web page which enable data to be downloaded directly.
Ethics and Legality of Web Scraping
- Web scraping is, in general, legal and won’t get you into trouble.
- There are a few things to be careful about, notably don’t overwhelm a web server and don’t steal content.
- Be nice. In doubt, ask.