What is Web Scraping?
- Humans are good at categorizing information, computers not so much.
- Often, data on a web site is not properly structured, making its extraction difficult.
- Web scraping is the process of automating the extraction of data from web sites.
- Tools may be available on a web page which enable data to be downloaded directly.
Anatomy of a web page
- Every website is built on an HTML document that structures its content.
- An HTML document is composed of elements, usually defined by an
opening
<tag>and a closing</tag>. - Elements can have attributes that define their properties, written
as
<tag attribute_name="value">. - CSS may be used to control the appearance of the rendered webpage.
- Dynamic webpages may have content which isn’t loaded until the item is selected.
Manually scrape data using browser extensionsUsing the Web Scraper Chrome extension
- Data that is relatively well structured (in a table) is relatively easily to scrape.
- More often than not, web scraping tools need to be told what to scrape.
- JQuery can be used to define more precisely what information is to be scraped.
Ethics and Legality of Web Scraping
- When web scraping you need to consider copyright, database rights, data protection and website terms.
- A UK exception allows non-commercial text and data mining only with lawful access and proper acknowledgement.
- Commercial scraping requires following terms of service, robots.txt.
- For all web scraping you need to avoid any circumvention of technical barriers.
- Key risks include collecting personal data, overwhelming servers, and inadvertently infringing rights — using APIs or asking data owners is often safer.