Researchers in Japan conducted the first (according to them) experiment that demonstrates how shockwave traffic jams emerge.
Check it out:
Or read the full story over at The New Scientist.
Researchers in Japan conducted the first (according to them) experiment that demonstrates how shockwave traffic jams emerge.
Check it out:
Or read the full story over at The New Scientist.
The folks over at The Situationist posted some great excerpts from a recent Edge video of Nicholas Christakis discussion on the social science of social networks.
Nicholas has been studying the psychology behind social networks for some time now and thinks that one of the shifts in thinking that he brings to the table is the way he fundamentally looks at social networks. Rather than thinking about them as static, he seems them as a combination of typology and change. This ever shifting dynamic nature of social networks is what makes them so fascinating and so powerful.
You can watch the entire video and read the whole transcript here or read the great summary by The Situationist here.
I’ve realized that I have way too much to say on each topic to combine them within posts. As such, from now on I’ll focus on a single topic per post.
Continuing from the previous post, let’s get right into it.
2. The Well-Chosen Average
Huff talks about three different ways of representing an average: mean, median, and mode. I’ve never heard someone refer to the mode as an average, but who knows, maybe we’ve all gotten a bit smarter since the ‘50s. The mode, or the number most represented in a distribution, is certainly relevant, but for all practical purposes, the mean and the median are the two statistics that I’d like to talk about.
First the definitions.
Mean: the arithmetic average of a series of numbers.
Median: the middle value in a series of numbers.
The reason we should care about these is because people can manipulate data by reporting a mean, when they really should be reporting a median.
First, it’s important to note that the reason that means are used so much more than medians is because many popular statistical tests work best with data that are normally distributed, and when this is the case, the mean equals the median…so why complicate things? Well, the reason is that often data are not normal.
Imagine that at a company 95% of the employees earn $25,000/year (we’ll call them assembly line workers) and the remaining 5% (we’ll call them management) earn $5,000,000/year. In that case (we’ll assume they have 100 employees), the mean salary is $273,750 while the median is $25,000. Clearly, a big difference. If someone told you that you should work for this company because their average salary is north of a quarter million dollars/year you’d be excited. You’d be less excited when you learned the reality.
Sometimes, however, neither of these statistics are all that useful. Imagine that you have a distribution like that one below.
A bimodal distribution like this one makes means and medians somewhat useless since the two represent central tendencies. With a bimodal (or multimodal) distribution, the central tendency isn’t that interesting. Both values will give you something that is in between the two masses. You often find such distributions in count data where there is a normal distribution around some mean and another around 0 or 1 (often just one of the two).
Imagine if we looked at the number of times per month that a person from a community visits a drug store. There will likely be some people who go semi-regularly (say 4-5 times/month) and they will generally fall into a nice normal distribution. However, there will also likely be a large group of people who never go and will all mass on 0. In this case, means and medians are meaningless since they combine two distinct groups.
The lesson here is simple, make sure you know what the statistics you are looking at represent. If you can, always examine the entire distribution. If you see a skew in the data, dig a bit deeper.
Part 1 - The Sample with the Built-in Bias
Part 3 - The Little Figures That Are Not There
Part 4 - TMuch Ado About Practically Nothing
I’ve been waiting patiently for the launch of the Encyclopedia of Life and it has finally come. For those of you unfamiliar with the EOL, here’s a blurb from their about page:
The Encyclopedia of Life (EOL) is an ambitious, even audacious project to organize and make available via the Internet virtually all information about life present on Earth. At its heart lies a series of Web sites—one for each of the approximately 1.8 million known species—that provide the entry points to this vast array of knowledge. The entry-point for each site is a species page suitable for the general public, but with several linked pages aimed at more specialized users. The sites sparkle with text and images that are enticing to everyone, as well as providing deep links to specific data.
As of today they have 30,000 pages (species) up and it’s amazing. I played around a bit and can’t seem to stop. For no apparent reason, I particularly enjoyed the page on the Anolis carolinensis (Green Anole).
Sorry for the lack of posts, but I have been away at a conference. Now that I’m back, I’m ready to get right back into it with a series of posts dedicated to a wonderful book: “How to Lie With Statistics” by Darrell Huff. This introduction to the many ways that statistics can and are used to manipulate reality was first published in 1954 and is still as useful as ever.
The book is broken down into the following 10 chapters:
1. The Sample with the Built-in Bias
2. The Well-Chosen Average
3. The Little Figures That Are Not There
4. Much Ado about Practically Nothing
5. The Gee-Whiz Graph
6. The One-Dimensional Picture
7. The Semi-attached Figure
8. Post Hoc Rides Again
9. How to Statisticulate
10. How to Talk Back to a Statistic
As great as the book is, the examples, as you can imagine, are a bit dated. My goal is to go through each of these topics and freshen them up with new examples. I also plan to include some suggestions for how to avoid the problems from the point of view of the compiler and the consumer of statistics.
This post will be dedicated only to part 1 since it is so critically important to statistics.
1. The Sample with the Built-in Bias
Sampling is wonderful. Rather than ask everyone how much they, for example, like something, we ask a small group and, assuming we did our job right, infer the population’s attitudes based on the sample’s responses. This saves time and money. But what happens when that sample is biased? In other words, what happens when the group of people we select from the population doesn’t really represent the population at large?
This problem is quite apparent in any type of consumer satisfaction (or opinion) research. Let’s say, for example, that Apple is interested in how satisfied their customers are with their iPhones. Rather than asking every iPhone user, they could sample a subset of them and derive their conclusions based on this group. Seems simple, right? Wrong.
Let’s think through the logistics for a second. There are several million iPhone users (yours truly is one of them). Are all of these users equally likely to respond to a survey about product satisfaction? Probably not. For example, the executive who barely has time to check his e-mail probably won’t respond to such a survey. In contrast, the Apple Fan Boys who rave and rant about Apple products might be more willing to do so. If this is true than any sampling technique will necessarily under-represent the opinions of business people and over-represent the opinions of fan boys.
Let’s look at a real example. In late 2007, ChangeWave released their results of a customer satisfaction survey of 3,654 consumers. They found that 82% of iPhone respondents reported that they were “very satisfied with their current cell phone” compared with only 51% of RIM (Blackberry) users. On the face of it, these results are compelling. But we should think before we preach the greatness of the iPhone. Let’s take this one step at a time.
Let’s start by looking at the sample size: 3,654. That’s pretty impressive. But wait, in the chart they list 9 different manufacturers. Which means that the number of iPhone users must be quite a bit less than 3,654. In fact, based on the article it seams like they only have 73 iPhone users (2% share * sample size). Do 73 people speak for all iPhone users? I doubt it. If Apple fan boys are overrepresented than satisfaction may be inflated. I’m not trying to say that ChangeWave intentionally manipulated their results, but they certainly didn’t do much to make them transparent.
As a quick aside (this is a topic of discussion for a later post, but worth mentioning here), we should be very wary of the comparisons that are being made. Because the chart above represents manufacturers, we can’t even really make a true comparison between the iPhone and the other products. RIM produces several models of the Blackberry and who knows how many models LG, Motorola or Sony/Ericsson have. The chart is comparing a single phone, the iPhone, with the average of all phones from other manufacturers. I’m sure some of the respondents who use RIM products are using dated Blackberries and might not be so happy. Likewise, some respondents may be using the latest RIM product and be just as satisfied as iPhone users. Because the data are averaged, we’ll never know.
So what should market researchers do to avoid such biases? Unfortunately, as they (hopefully) know, there is no easy fix. Obtaining a representative sample from any population is nearly impossible. The best anyone can do is attempt to identify the different types of respondents that exist in the universe in question and sample equally from each group. This approach, called stratified sampling, has its pros and cons. The biggest pro is that if the stratification is done correctly (BIG ‘if’) than some of the issues I mentioned before can be avoided. However, compared to simple random sampling this could result in a bias if the stratification is incorrect, again under- and over-representing different parts of the population. In short, there is no perfect way to sample, but striving for perfection is a must.
As for the consumer of statistics, it’s critically important to always ask questions like: “who are the respondents in this research?” and “are all members of the population accurately represented?” If you can’t answer those two questions with any sense of conviction, make sure you read the statistics with a big grain of salt.
One final note on sampling: there is a large debate regarding the US Census and sampling. On the one hand, Article I, Section 2 of the US Constitution requires that citizens be counted to determine the number of Representatives to the House of Representatives from a given state. A strict interpretation of the Constitution (and one backed by the Supreme Court) suggests that sampling can not be used to determine the US population. On the other hand, conducting a census (i.e. counting everyone) has some major problem. I won’t go into all the details, but the short version is that many groups of citizens, and especially minorities, are massively under-represented by census taking. For an excellent discussion as to why, I’ll direct you to this article by Ivars Persron in Science Magazine. If you have a few minutes, it’s a worthwhile read.
Next time I’ll discuss the pitfalls of means, medians, and modes (chapter 2) and the need for more information about statistical figures (chapter 3).
Part 2 - The Well-Chosen Average
Part 3 - The Little Figures That Are Not There
Part 4 - TMuch Ado About Practically Nothing
Comments