Archive for the ‘Science’ Category

Remembering Arthur C. Clarke

I didn’t grow up with Mysterious World playing the background. Nor did I grow up reading his stories, but now, as an adult who has a love technology, exploration, and imagination I find myself genuinely sad at the lose of a great man, Sir Arthur C. Clarke.

At the age of 90, Clarke passed away in is Sri Lankan home. He will be missed by many and honored by more.

Not long before his death Clarke recorded this video where he made some jokes, told some stories, and made three wishes. He wished for evidence of extraterrestrial life, a move to clean energy, and peace in Sri Lanka. With all three of those, I agree wholeheartedly. If you have 10 minutes to spare, I suggest you watch this moving final message to the world from a great man.

You need to a flashplayer enabled browser to view this YouTube video


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb
The Legality of Web Crawling

I got a great response to from Aaron to my “ Ethics of Web Crawling” post from yesterday and decided to find out where I stood legally on this issue. Luckily, my brother is a lawyer who happens to specialize in intellectual property law.

I’d like to stress that that is NOT legal advice, but just the opinion of one lawyer who hasn’t actually seen any of the relevant information. Please do NOT act on this advice. If you have a real issue, consult a lawyer.

That said, here are the legal issues.

1. Is this fair use?

Answer. Kind of. Fair use refers to the tenant in copyright law that allows someone to use copyrighted material without permission of the copyright holder for such things as reviews and academic pursuits. In this case, my use of the data I crawled would be for academic purposes and so would fall under fair use.

However, my brother actually thinks that the data do not fall under any copyright protection at all. He explained that you can not copyright data, only the presentation of that data. So for example, had I taken a screenshot of a website and published it, I would be in the realm of copyright. Would that be fair use? It depends on a few things that I won’t get into now.

In the case of web crawling, I am collecting data and presenting it in a very different (and actually aggregate) manner from which it was published. Because I’m not actually reprinting their exact material, I’m in the clear.

The analogy my brother used was that of writing a review for a book. Under fair use, I can quote that book in my review and be fine…but I would still fall under copyright laws (I would just be in the clear because of fair use). In contrast, if I reported that the book had 347 pages numbered 1, 2, 3…, 347 I wouldn’t be regulated by any copyright whatsoever (fair use or otherwise) since I’m reporting data, not content.

2. Did I violate their Terms of Service?

Answer: What TOS? Their TOS says that I can’t “copy, modify, publish, transmit, distribute, perform, display, or sell” any information from their site. Clearly by taking their data and publishing an analysis of it I would be both modifying and publishing. Fair use aside, the problem with their TOS is that it does not require for me to accept it. It is buried on a separate page that I only found after actively searching. Had I had to agree to the TOS upon arriving to the site, then maybe they would have a case (though the argument above would probably still hold). Because the TOS was not accepted by me, it does not actually apply.

Conclusion: I’m legally in the clear. Ethically, that’s a different question…one I’m still grappling with.


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb
The Ethics of Web Crawling

null

As a behavioral researcher I often find it interesting to look at real world data in order to supplement my experimentally derived conclusions. Not only does this lend a sense of credibility to any findings, but it also makes for a far more interesting and memorable story.

Recently I came across a website (we’ll leave it unnamed for the time being) that, because of the nature of the service they were providing, had a natural experiment running for the last few years that would test my hypothesis perfectly. I realized that access to these data would be incredible. If my theory could be borne out in the real world, I would be thrilled!

However, I also realized that due to the nature of the business, this website would likely be very reluctant to hand over their data. So I did what any reasonable person with a programming background would do—I wrote a web crawler and systematically collected all the data that was available to the general public. It turns out that my hypothesis was, in fact, supported by these real world data, so my effort was not in vein. But was I right to do this?

I’ve been struggling with the ethicality of this issue for a few days now and genuinely don’t have a good answer. On the one hand, the data are freely available to anyone who wants to view them. There is no registration required and the system uses a simple indexing method that allows for trivial crawling. On the other hand, if I publish a paper with these data I am going to reveal some information about this firm that they may not want out there. This company’s data are proprietary to them and by posting any of it on the web they are implicitly assuming some level of trust from their viewers.

Does the benefit to science outweigh the breach of trust?
Honestly, I don’t know. I plan on contacting the firm and telling them what I did. My hope is that they will be excited by my conclusions and try and incorporate this new knowledge into their business. Unfortunately, I suspect that instead they will get defensive and demand that I not publish my results.

So what should I do? What do you think is the right course of action?


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb
A Modern Take on Change Blindness

DOTHETEST is a public service announcement in the UK designed to inform motorists about watching out for bicycles. Actually, it’s a modern take on an old classic, change blindness. You can head over to the Visual Cognition Lab at the University of Illinois for more great examples.

I won’t spoil the fun by giving anything away. I suggest you just go and watch the video.


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb
How To Lie With Statistics - Part 3

How to Lie With Statistics

Continuing on in my How To Lie With Statistics series, let’s talk about “The Little Figures That Are Not There.”

We’ll subdivided this part into two sections: sample size and range vs. average.

Sample Size

Very often we see claims about success rates for all sorts of things—“80% of doctors agree that this medication will work better than this other one”—“90% of all consumers prefer detergent X to detergent Y”—and so on. These claims are often used to promote a product or an idea, but can often be very misleading. If 5 doctors are sampled in the first example, is it really fair to say that 80% of doctors agree about anything? The point is that an average is important, but a sample size is equally important.

Let’s take a recent example. I was listening to The Naked Scientist podcast the other day and heard an interesting story about Megan Sykes, a doctor and researcher at Harvard who found a new way to prevent rejection of organ transplants. She found that 80% of the patients she tried the procedure with were able to stop using immunosuppressive drugs much sooner than normal and rejection rates were amazingly low (20% actually). This is great news!

Too bad she only tried this with 5 patients. If we look up the paper that this was published in, we find that her sample size was a whopping 5 patients. Now don’t get me wrong, I think her research is incredible and could lead to wonderful things, but I do think it’s a bit early to be praising this new technique. A larger sample size would be required for that.

This also brings up the idea of statistical significance. Had she reported the appropriate statistical test ( chi-square in this case) we would have learned that the p-value was just .18, which indicates that, just by chance, she was 18% likely to get this result. For someone looking to get a transplant, that’s a pretty large number. Generally, a p-value less than .05 is considered “good” (this is a debate for another post). The point here is that we should all be wary of statistics that don’t include a sample size!

Range vs. Average

According to the 2000 US Census, the average US household size is 2.59 persons. Last I checked, I don’t know what a .59 person looks likes. Obviously the Census is reporting an average, but wouldn’t a range be much more useful?

If I’m a housing developer and I see this figure, I start building homes for 2.59 people. What I miss out on are all the single person households (about 26% in the US) and the multi-family households (about 4%). This is similar to the issue of a distribution I talked about in the previous post. Again, the point here is that while averages are quick and simple, ranges are far more informative.

Part 1 - The Sample with the Built-in Bias

Part 2- The Well-Chosen Averge
Part 4 - TMuch Ado About Practically Nothing


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb