The Ethics of Web Crawling

null

As a behavioral researcher I often find it interesting to look at real world data in order to supplement my experimentally derived conclusions. Not only does this lend a sense of credibility to any findings, but it also makes for a far more interesting and memorable story.

Recently I came across a website (we’ll leave it unnamed for the time being) that, because of the nature of the service they were providing, had a natural experiment running for the last few years that would test my hypothesis perfectly. I realized that access to these data would be incredible. If my theory could be borne out in the real world, I would be thrilled!

However, I also realized that due to the nature of the business, this website would likely be very reluctant to hand over their data. So I did what any reasonable person with a programming background would do—I wrote a web crawler and systematically collected all the data that was available to the general public. It turns out that my hypothesis was, in fact, supported by these real world data, so my effort was not in vein. But was I right to do this?

I’ve been struggling with the ethicality of this issue for a few days now and genuinely don’t have a good answer. On the one hand, the data are freely available to anyone who wants to view them. There is no registration required and the system uses a simple indexing method that allows for trivial crawling. On the other hand, if I publish a paper with these data I am going to reveal some information about this firm that they may not want out there. This company’s data are proprietary to them and by posting any of it on the web they are implicitly assuming some level of trust from their viewers.

Does the benefit to science outweigh the breach of trust?
Honestly, I don’t know. I plan on contacting the firm and telling them what I did. My hope is that they will be excited by my conclusions and try and incorporate this new knowledge into their business. Unfortunately, I suspect that instead they will get defensive and demand that I not publish my results.

So what should I do? What do you think is the right course of action?


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb

3 Responses to “The Ethics of Web Crawling”

    1. Aaron Schiff March 24th, 2008 at 6:23 pm

      Many commercial websites have terms of use that prohibit web crawling to collect data. For example airline sites often will not let you collect their price data in bulk. You should check if the site you crawled has such a policy. Also check the server’s robots.txt file to see if they’ve prevented indexing by search engines. If neither of these things apply, I’d say it’s safe to assume that the site does not mind their data being crawled (it will have been crawled by Google etc already anyway).

    1. Anonymous Prof March 24th, 2008 at 8:37 pm

      Hey Aaron,

      That’s a great point. I checked for a robots.txt file and there was none. As for their TOS, I’m not sure how to interpret the following:

      Under a section titled: “Proprietary Rights; Confidentiality”
      “The Website contains the copyrighted material, trademarks and other proprietary information of Kiva and its licensors. Except for that information that is in the public domain or for which you have been given express written permission, you may not copy, modify, publish, transmit, distribute, perform, display, or sell any such proprietary information.”

      I’m not sure if that covers web crawling or not. Also, b/c my goal for the data is academic and not for profit, would my crawling fall under fair use? Perhaps I should consult a lawyer (luckily, my brother is an expert in this area…looks like he’ll be getting a call tomorrow).

      Thanks!
      -AP

    1. Anonymous Prof » Blog Archive » The Legality of Web Crawling March 25th, 2008 at 9:40 am

      [...] got a great response to from Aaron to my “ Ethics of Web Crawling” post from yesterday and decided to find out where I stood legally on this issue. Luckily, my [...]

Leave a Reply