Archive for February, 2008

Protestants Verging on Becoming Minorities. I Don’t Think So

Quite a few news outlets are covering the recent Pew Forum on Religion & Public Life report detailing the changing nature of the religious landscape in this country. I actually read through a good part of the report looking for statistical faux pas to use in my “How To Lie With Statistics Series,” but found only minor ones…good job.

What I’d like to highlight, however, is the way in which the media is, to say the least, stretching the results.

Here are the headlines and choice quotes from a few “reputable” sources:

1. US News & World Report. Headline: Protestants Verging on Becoming Minorities - US News and World Report.
2. AFP. Headline: Protestants on verge of becoming minority in US: study.
3. Boston Globe. Choice quote: “A sweeping new study of religious affiliation in the United States finds a country in which Protestants are becoming a minority”.
4. The Australian. Headline: US protestants verge on minority.

Really, minorities? The report shows that Protestant Americans have dropped in numbers to 51% and US News & World Report seems to think that if they drop below 50% they will be a minority. Technically they would be correct, but they would be more correct in saying that Protestants would still be a plurality as they would be, by far, the largest religious group in the country. Something tells me that Protestants won’t be treated like other “minorities” in this country.


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb
How To Lie With Statistics – Part 1

Sorry for the lack of posts, but I have been away at a conference. Now that I’m back, I’m ready to get right back into it with a series of posts dedicated to a wonderful book: “How to Lie With Statistics” by Darrell Huff. This introduction to the many ways that statistics can and are used to manipulate reality was first published in 1954 and is still as useful as ever.

How to Lie With Statistics

The book is broken down into the following 10 chapters:
1. The Sample with the Built-in Bias
2. The Well-Chosen Average
3. The Little Figures That Are Not There
4. Much Ado about Practically Nothing
5. The Gee-Whiz Graph
6. The One-Dimensional Picture
7. The Semi-attached Figure
8. Post Hoc Rides Again
9. How to Statisticulate
10. How to Talk Back to a Statistic

As great as the book is, the examples, as you can imagine, are a bit dated. My goal is to go through each of these topics and freshen them up with new examples. I also plan to include some suggestions for how to avoid the problems from the point of view of the compiler and the consumer of statistics.

This post will be dedicated only to part 1 since it is so critically important to statistics.

1. The Sample with the Built-in Bias

Sampling is wonderful. Rather than ask everyone how much they, for example, like something, we ask a small group and, assuming we did our job right, infer the population’s attitudes based on the sample’s responses. This saves time and money. But what happens when that sample is biased? In other words, what happens when the group of people we select from the population doesn’t really represent the population at large?

This problem is quite apparent in any type of consumer satisfaction (or opinion) research. Let’s say, for example, that Apple is interested in how satisfied their customers are with their iPhones. Rather than asking every iPhone user, they could sample a subset of them and derive their conclusions based on this group. Seems simple, right? Wrong.

Let’s think through the logistics for a second. There are several million iPhone users (yours truly is one of them). Are all of these users equally likely to respond to a survey about product satisfaction? Probably not. For example, the executive who barely has time to check his e-mail probably won’t respond to such a survey. In contrast, the Apple Fan Boys who rave and rant about Apple products might be more willing to do so. If this is true than any sampling technique will necessarily under-represent the opinions of business people and over-represent the opinions of fan boys.

Let’s look at a real example. In late 2007, ChangeWave released their results of a customer satisfaction survey of 3,654 consumers. They found that 82% of iPhone respondents reported that they were “very satisfied with their current cell phone” compared with only 51% of RIM (Blackberry) users. On the face of it, these results are compelling. But we should think before we preach the greatness of the iPhone. Let’s take this one step at a time.

Let’s start by looking at the sample size: 3,654. That’s pretty impressive. But wait, in the chart they list 9 different manufacturers. Which means that the number of iPhone users must be quite a bit less than 3,654. In fact, based on the article it seams like they only have 73 iPhone users (2% share * sample size). Do 73 people speak for all iPhone users? I doubt it. If Apple fan boys are overrepresented than satisfaction may be inflated. I’m not trying to say that ChangeWave intentionally manipulated their results, but they certainly didn’t do much to make them transparent.

As a quick aside (this is a topic of discussion for a later post, but worth mentioning here), we should be very wary of the comparisons that are being made. Because the chart above represents manufacturers, we can’t even really make a true comparison between the iPhone and the other products. RIM produces several models of the Blackberry and who knows how many models LG, Motorola or Sony/Ericsson have. The chart is comparing a single phone, the iPhone, with the average of all phones from other manufacturers. I’m sure some of the respondents who use RIM products are using dated Blackberries and might not be so happy. Likewise, some respondents may be using the latest RIM product and be just as satisfied as iPhone users. Because the data are averaged, we’ll never know.

So what should market researchers do to avoid such biases? Unfortunately, as they (hopefully) know, there is no easy fix. Obtaining a representative sample from any population is nearly impossible. The best anyone can do is attempt to identify the different types of respondents that exist in the universe in question and sample equally from each group. This approach, called stratified sampling, has its pros and cons. The biggest pro is that if the stratification is done correctly (BIG ‘if’) than some of the issues I mentioned before can be avoided. However, compared to simple random sampling this could result in a bias if the stratification is incorrect, again under- and over-representing different parts of the population. In short, there is no perfect way to sample, but striving for perfection is a must.

As for the consumer of statistics, it’s critically important to always ask questions like: “who are the respondents in this research?” and “are all members of the population accurately represented?” If you can’t answer those two questions with any sense of conviction, make sure you read the statistics with a big grain of salt.

One final note on sampling: there is a large debate regarding the US Census and sampling. On the one hand, Article I, Section 2 of the US Constitution requires that citizens be counted to determine the number of Representatives to the House of Representatives from a given state. A strict interpretation of the Constitution (and one backed by the Supreme Court) suggests that sampling can not be used to determine the US population. On the other hand, conducting a census (i.e. counting everyone) has some major problem. I won’t go into all the details, but the short version is that many groups of citizens, and especially minorities, are massively under-represented by census taking. For an excellent discussion as to why, I’ll direct you to this article by Ivars Persron in Science Magazine. If you have a few minutes, it’s a worthwhile read.

Next time I’ll discuss the pitfalls of means, medians, and modes (chapter 2) and the need for more information about statistical figures (chapter 3).

Part 2 - The Well-Chosen Average

Part 3 - The Little Figures That Are Not There
Part 4 - TMuch Ado About Practically Nothing


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb
What’s The Value of a Digg?

For those of you who are impatient, here’s the punch line. On average, a single digg increases traffic by 0.10%. So a story that gets 3,000 diggs results in an increase in total traffic to the referring site by 300%. Now for those of you who want to know the entire story, read on.

All too often you see websites going down to do the “digg effect.” There’s even a wikipedia page for it (though for some reason it’s titled “Slashdot effect”). You also often see comments on digg like “only 84 diggs and it’s already down!” promptly dugg down because, as other observant commenters point out, “you’re an idiot, 1 digg <> 1 click.” This is, of course, due to the fact that you get plenty of free riders on digg (yours truly included) who read tons of stories but never digg them up.

This led me to the obvious question: what then, is the value of a single digg? In other words, how much traffic is generated from someone digging a site? With the help of the digg API and the ALEXA data service provided by Amazon I decided to answer this question.

Here’s what I did:

1. I picked some random date in January (the 12th if you’re curious) and worked back in time (to Oct 6, 2007…again for no particular reason), collecting basic info on 5,794 stories that were “popular” (made it to the front page).
2. ALEXA only provides pageview data for the top 100,000 sites on the internet (as per them), so I checked the list of sites I collected against their database and came back with 1,999 stories that I could follow.
3. I then collected all the digg histories for these stories as well as the pageviews (from ALEXA ) for each website 7 days prior and following a digg submission.

The idea was to use the ALEXA data as a proxy for total pageviews and compare that to the digg rate per story.

First, some basics info:

As per ALEXA, a pageview reflects 1 user per million. So we don’t have the exact number of people visiting a site since we don’t know how many users are out there, but we can easily track % changes. It wouldn’t be fair to group all the websites together since the impact of a digg for a large site (say mozilla.com…currently ranked the 44th more visited site on the net) is likely smaller than for a small site (say anonymousprof.com…currently far from being ranked) due to their large difference in existing traffic. To combat this problem, I created buckets for website size based off of the distribution of my sample (see below).

digg_histogrampagiews.gif

This resulted in the following groupings:
Small Size: 5 or fewer pageviews/million (n = 664)
Mid Size: 100 or fewer pageviews/million (n = 831)
Large Size: the rest (n = 504)

It’s also reasonable to assume that stories with different overall digg amounts should have different impacts on traffic. Following the same procedure I made 3 more cuts at the data.

digg_histogrampdiggs.gif

Small # of Diggs: 700 or fewer (n = 714)
Mid # of Diggs: 1500 or fewer (n = 819)
Large # of Diggs: the rest (n =466)

Now to the fun stuff. For all of these analyses I will show three different sets of data, one for each of the website sizes previously specified. I tried pooling everything, but the charts become incomprehensible.

Let’s start with the number of pageviews for the 15 day period starting 7 days prior to a post and 7 days following.

Each line represents a bucket of digg sizes. For example, the black line represents all stories that had a maximum of 700 diggs throughout their lifetime.

digg_large.gif

We see that small websites clearly benefit from being dugg and, not surprisingly, the bigger the story on digg (the more total diggs) the greater the increase in traffic. This is less true for the medium sized sites, and, apparently, not true at all for the big guys. None of this is surprising since we would expect smaller sites to benefit the most. The big guys are already getting lots of the traffic that would have come from digg.

The problem with the previous charts is that we don’t really get to see the value to the sites because we don’t know what a pageview/million really is. Instead, let’s look at the same data but as % gain per day. In order to do this we need a reference point and so we will average the pageviews for the 7 days prior to the story appearing on digg and assume that that is baseline traffic. We then compute the gain per day relative to this baseline and look at the % difference. I’ll also include the total # of diggs per day on the secondary axis (dashed lines) so you can see exactly where the traffic is coming from.

digg_smallgain.gif

digg_midgain.gif

digg_largegain.gif

Here we can clearly see that the benefit to smaller sites is greater than for larger sites. Interestingly, large sites seem to suffer from stories that don’t get too many diggs. I suspect this is just noise, but worth noting nonetheless.

So far, I have yet to answer my initial question: what’s the value of a single digg? The simple computation would be to sum the total gain in pageviews and divide that by the number of diggs (as I did in the opening paragraph). However, that would be misleading because the effect should be different for different sized sites. Also, because the frequency of digging goes down with time, each additional digg likely reflects a larger increase in traffic. So let’s plot the gain in traffic/digg by size of site across time.

digg_smallvalue.gif

digg_midvalue.gif

digg_largevalue.gif

Pretty cool! For small guys, the value increases with time since the denominator (# of diggs) falls quickly. This makes sense if we look back a few charts and notice that overall traffic dies down by this point. So for every digg late in the life of a story we see a large % gain. The mid size guys are all over the place, but we can still see that, on the whole, each digg helps a little bit. What’s really surprising though, is the value of a digg for large sites. It’s negative! I’m not sure I have a good hypothesis for this one, but I’ll be glad to hear some in the comments.

Finally, we can ignore the time difference and just collapse the value of the digg by website size and digg size. (error bars reflect standard errors)

digg_allcomparison.gif

We see here that, as predicted, the effect of a single digg is greater for smaller sites than for larger ones. In fact, the overall benefit for large sites is pretty much non-existent. Sorry Gawker. And, like I said at the outset, the overall effect is 0.10% increase in traffic per digg.

What’s also interesting is that it doesn’t appear that getting a tremendous amount of diggs helps that much more than just getting a lot of diggs. It looks like hitting the front page is all that matters. Once you’re there, there’s little difference between having 800 and 5,000 diggs.

Of course, there are some clear limitations to this analysis:

1. Because Alexa only has data for the top 100,000 sites I can’t see the effect of digging a really tiny site like mine. Though if this story hits the front page, I can give you an even more detailed analysis using my own data.
2. I’m only looking at stories that hit the front page. What about all those stories that never make it? Clearly they lead to traffic, but how much? That’ll have to be the topic of another discussion.
3. I don’t know what a pageview/million really is, so it’s hard to say something like: “1 digg = X clicks.” The best I can do is calculate a % gain.

If you have thoughts on how I can improve this analysis, please leave me a comment. I spent a lot of time working on this, but I’m sure it can be improved.


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb
Moving Images to Flickr

This shouldn’t really affect any of you, but I’m moving most of the images on my site to Flickr. I wasn’t very smart in managing the size of my images and already racked up over 1.5gb of bandwidth on my server. I figure moving the images offsite will help a lot.


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb
Music Preference Dispersion Through the Last.fm Network: Data Collection Approach

First an update on my attempt to visualize more of the data I collected last week. Despite my best efforts, I just can’t get any visualization programs to run all of the data I have (now over 1,000,000 friendships). It seems like when I go past about 25,000-50,000 users, the time to process the data goes way beyond anything my system can handle. Too bad. If someone has a super computer somewhere running social network visualization software on it, let me know and I’ll send you my data.

However, I’m far from ready to call it quits. Instead, I’m taking a new approach. Here’s the plan:

1. Rather than collecting data on all users, I will start with one or two “seed” users and perform a snowballing data collection process with 2 degrees of freedom. This should yield about 5,000 friendships per seed (in fact I did this for one seed already and it came to 4,844 friendships).
2. I will also collect data on users who are not part of that particular social network to use as a control.
3. Then I will collect the music listening history for each user. This, from my experience, will take a while.
4. Then I will attempt to map the diffusion of new song introductions to this network.

The idea is that as a new song emerges it has to start somewhere. If social networks help spread new songs/artists, than I would expect to see a spreading of preferences throughout the network. So the probability of a “friend” (or a friend of a friend) choosing to listen to a song is greater than the probability of a non-friend.

This should let me do some interesting statistical analyses as well as produce some cool visualization movies. Imagine a mesh with each node representing a user. Because I have a time series of listening history I can have each node “light up” when that user listens to a song. If there is no network effect than the lighting up should be random. However, if there is, than I should observe a spreading of “lights” throughout the network.

I’m leaving for a conference tomorrow so my blogging rate will slow down a bit (back next Monday), but I’ll have my data-collecting programs running in the mean time. Hopefully, when I return I’ll have more data than I’ll know what to do with!


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb