I got a great response to from Aaron to my “ Ethics of Web Crawling” post from yesterday and decided to find out where I stood legally on this issue. Luckily, my brother is a lawyer who happens to specialize in intellectual property law.
I’d like to stress that that is NOT legal advice, but just the opinion of one lawyer who hasn’t actually seen any of the relevant information. Please do NOT act on this advice. If you have a real issue, consult a lawyer.
That said, here are the legal issues.
1. Is this fair use?
Answer. Kind of. Fair use refers to the tenant in copyright law that allows someone to use copyrighted material without permission of the copyright holder for such things as reviews and academic pursuits. In this case, my use of the data I crawled would be for academic purposes and so would fall under fair use.
However, my brother actually thinks that the data do not fall under any copyright protection at all. He explained that you can not copyright data, only the presentation of that data. So for example, had I taken a screenshot of a website and published it, I would be in the realm of copyright. Would that be fair use? It depends on a few things that I won’t get into now.
In the case of web crawling, I am collecting data and presenting it in a very different (and actually aggregate) manner from which it was published. Because I’m not actually reprinting their exact material, I’m in the clear.
The analogy my brother used was that of writing a review for a book. Under fair use, I can quote that book in my review and be fine…but I would still fall under copyright laws (I would just be in the clear because of fair use). In contrast, if I reported that the book had 347 pages numbered 1, 2, 3…, 347 I wouldn’t be regulated by any copyright whatsoever (fair use or otherwise) since I’m reporting data, not content.
2. Did I violate their Terms of Service?
Answer: What TOS? Their TOS says that I can’t “copy, modify, publish, transmit, distribute, perform, display, or sell” any information from their site. Clearly by taking their data and publishing an analysis of it I would be both modifying and publishing. Fair use aside, the problem with their TOS is that it does not require for me to accept it. It is buried on a separate page that I only found after actively searching. Had I had to agree to the TOS upon arriving to the site, then maybe they would have a case (though the argument above would probably still hold). Because the TOS was not accepted by me, it does not actually apply.
Conclusion: I’m legally in the clear. Ethically, that’s a different question…one I’m still grappling with.
















The bit about the basic data not being copyrighted is interesting. I know that copyright only protects the form of expression and not the content itself. But I’m wondering where the limits are. What if someone has a table of data and you reformat it a bit, is that ok? Or what if you turn the table into a graph?
In terms of the TOS, I guess a lawyer could argue that way. On the other hand the risk is that the website doesn’t see things the same way, and wants to go to court to find out the answer, which could be expensive.
As far as ethics go, if they really didn’t want their site crawled, they should have put something in robots.txt. If not then Google’s already got all their data and done any number of things to it. And Google does no evil, right?