Ethics and Legality of Web Scraping

Last updated on 2026-01-14 | Edit this page

Overview

Questions

  • When is web scraping OK and when is it not?
  • Is web scraping legal? Can I get into trouble?
  • What are some ethical considerations to make?
  • What can I do with the data that I’ve scraped?

Objectives

After completing this episode, participants should be able to…

  • Discuss the legal and ethical implications of web scraping
  • Establish a code of conduct

Now that we have seen several different ways to scrape data from websites and are ready to start working on potentially larger projects, we may ask ourselves whether there are any legal implications of writing a piece of computer code that downloads information from the Internet.

In this section, we will be discussing some of the issues to be aware of when scraping websites, and we will establish a code of conduct (below) to guide our web scraping projects.

The internet isn’t as open as it once was. What used to be a vast, freely accessible source of information has become a valuable reservoir of data —especially for training machine learning and generative AI models. In response, many social media platforms and website owners have either started monetizing access to their data or taken steps to protect their resources from being overwhelmed by automated bots.

As a result, it’s increasingly common for websites to include explicit prohibitions against web scraping in their Terms of Service (TOS). To avoid legal or ethical issues, it’s essential to check both the TOS and the site’s robots.txt file before scraping.

You can usually find a site’s robots.txt file by appending /robots.txt to the root of the domain—for example: https://facebook.com/robots.txt (not https://facebook.com/user/robots.txt). Both the TOS and robots.txt will help you understand what is allowed and what isn’t, so it’s important to review them carefully before proceeding.

Discussion

Challenge

Visit Facebook’s Terms of Service and its robots.txt file. What do they say about web scraping or collecting data using automated means? Compare it to Reddit’s TOS and Reddit’s robots.txt.

Don’t break the web: Denial of Service attacks


The first and most important thing to be careful about when writing a web scraper is that it typically involves querying a website repeatedly and accessing a potentially large number of pages. For each of these pages, a request will be sent to the web server that is hosting the site, and the server will have to process the request and send a response back to the computer that is running our code. Each of these requests will consume resources on the server, during which it will not be doing something else, like for example responding to someone else trying to access the same site.

If we send too many such requests over a short span of time, we can prevent other “normal” users from accessing the site during that time, or even cause the server to run out of resources and crash.

In fact, this is such an efficient way to disrupt a web site that hackers are often doing it on purpose. This is called a Denial of Service (DoS) attack.

Since DoS attacks are unfortunately a common occurence on the Internet, modern web servers include measures to ward off such illegitimate use of their resources. They are watchful for large amounts of requests appearing to come from a single computer or IP address, and their first line of defense often involves refusing any further requests coming from this IP address.

A web scraper, even one with legitimate purposes and no intent to bring a website down, can exhibit similar behaviour and, if we are not careful, result in our computer being banned from accessing a website.

The good news is that a good web scraper, such as the WebScraper extension used in this lesson, recognizes that this is a risk and includes measures to prevent our code from appearing to launch a DoS attack on a website. This is mostly done by inserting a random delay between individual requests, which gives the target server enough time to handle requests from other users between ours.


It is important to recognize that in certain circumstances web scraping can be illegal. If the terms and conditions of the web site we are scraping specifically prohibit downloading and copying its content, then we could be in trouble for scraping it.

In practice, however, web scraping is a tolerated practice, provided reasonable care is taken not to disrupt the “regular” use of a web site, as we have seen above.

In a sense, web scraping is no different than using a web browser to visit a web page, in that it amounts to using computer software (a browser vs a scraper) to acccess data that is publicly available on the web.

In general, if data is publicly available (the content that is being scraped is not behind a password-protected authentication system), then it is OK to scrape it, provided we don’t break the web site doing so. What is potentially problematic is if the scraped data will be shared further. For example, downloading content off one website and posting it on another website (as our own), unless explicitely permitted, would constitute copyright violation and be illegal.

However, most copyright legislations recognize cases in which reusing some, possibly copyrighted, information in an aggregate or derivative format is considered “fair use”. In general, unless the intent is to pass off data as our own, copy it word for word or trying to make money out of it, reusing publicly available content scraped off the internet is OK.

Better be safe than sorry

Be aware that copyright and data privacy legislation typically differs from country to country. Be sure to check the laws that apply in your context. For example, in Australia, it can be illegal to scrape and store personal information such as names, phone numbers and email addresses, even if they are publicly available.

If you are looking to scrape data for your own personal use, then the above guidelines should probably be all that you need to worry about. However, if you plan to start harvesting a large amount of data for research or commercial purposes, you should probably seek legal advice first.

If you work in a university, chances are it has a copyright office that will help you sort out the legal aspects of your project. The university library is often the best place to start looking for help on copyright.

Be nice: ask and share


Depending on the scope of your project, it might be worthwhile to consider asking the owners or curators of the data you are planning to scrape if they have it already available in a structured format that could suit your project. If your aim is do use their data for research, or to use it in a way that could potentially interest them, not only it could save you the trouble of writing a web scraper, but it could also help clarify straight away what you can and cannot do with the data.

On the other hand, when you are publishing your own data, as part of a research project, documentation or a public website, you might want to think about whether someone might be interested in getting your data for their own project. If you can, try to provide others with a way to download your raw data in a structured format, and thus save them the trouble to try and scrape your own pages!

Web scraping Code of Conduct


To conclude, here is a brief code of conduct you should keep in mind when doing web scraping:

  1. Ask nicely whether you can access the data in another way. If your project relies on data from a particular organization, consider reaching out to them directly or checking whether they provide an API. With a bit of luck they might offer the data you need in a structured format, saving you time and effort.

  2. Don’t download content that’s clearly not public. For example, academic journal publishers often impose strict usage restrictions on their databases. Mass-downloading PDFs can violate these rules and may get you —or your university librarian— into trouble.

    If you need local copies for a legitimate reason (e.g., text mining), special agreements may be possible. Your university library is a good place to start exploring those options.

  3. Check your local legislation. Many countries have laws protecting personal information, such as email addresses or phone numbers. Even if this data is visible on a website, scraping it could be illegal depending on your jurisdiction (e.g., in Australia).

  4. Don’t share scraped content illegally. Scraping for personal use is often considered fair use, even when it involves copyrighted material. But sharing that data, especially if you don’t have the rights to distribute it, can be illegal.

  5. Share what you can. If the scraped data is public domain or you’ve been granted permission to share it, consider publishing it for others to reuse (e.g., on datahub.io). Also, if you wrote a scraper to access it, sharing your code (e.g., on GitHub) can help others learn from and build on your work.

  6. Publish your own data in a reusable way. Make it easier for others by offering your data in open, software-agnostic formats like CSV, JSON, or XML. Include metadata that describes the content, origin, and intended use of the data. Ensure it’s accessible and searchable by search engines.

  7. Don’t break the Internet. Some websites can’t handle high volumes of requests. If your scraper is recursive (i.e., it follows links), test it first on a small subset.

    Be respectful by setting delays between requests and limiting the rate of access. You’ll learn more about how to do this in the next episode.

Following these guidelines helps ensure that your scraping is ethical, legal, and considerate of the broader web ecosystem.

Other notes


  • Boundaries for using/modifying user-agents
  • Obey robots.txt
  • Respect rate limits
  • If unsure contact site administrator to seek permission to scrape
  • Copyright laws - can legally scrape publically available data but republishing or reusing data may require permission
Key Points
  • Web scraping is, in general, legal and won’t get you into trouble.
  • Always review and respect a website’s Terms of Service (TOS) before scraping its content.
  • There are a few things to be careful about, notably don’t overwhelm a web server and don’t steal content.
  • Be nice. In doubt, ask.