Archive for February, 2008

How To Lie With Statistics – Part 2

How to Lie With Statistics

I’ve realized that I have way too much to say on each topic to combine them within posts. As such, from now on I’ll focus on a single topic per post.

Continuing from the previous post, let’s get right into it.

2. The Well-Chosen Average

Huff talks about three different ways of representing an average: mean, median, and mode. I’ve never heard someone refer to the mode as an average, but who knows, maybe we’ve all gotten a bit smarter since the ‘50s. The mode, or the number most represented in a distribution, is certainly relevant, but for all practical purposes, the mean and the median are the two statistics that I’d like to talk about.

First the definitions.

Mean: the arithmetic average of a series of numbers.
Median: the middle value in a series of numbers.

The reason we should care about these is because people can manipulate data by reporting a mean, when they really should be reporting a median.

First, it’s important to note that the reason that means are used so much more than medians is because many popular statistical tests work best with data that are normally distributed, and when this is the case, the mean equals the median…so why complicate things? Well, the reason is that often data are not normal.

Imagine that at a company 95% of the employees earn $25,000/year (we’ll call them assembly line workers) and the remaining 5% (we’ll call them management) earn $5,000,000/year. In that case (we’ll assume they have 100 employees), the mean salary is $273,750 while the median is $25,000. Clearly, a big difference. If someone told you that you should work for this company because their average salary is north of a quarter million dollars/year you’d be excited. You’d be less excited when you learned the reality.

Sometimes, however, neither of these statistics are all that useful. Imagine that you have a distribution like that one below.

A bimodal distribution like this one makes means and medians somewhat useless since the two represent central tendencies. With a bimodal (or multimodal) distribution, the central tendency isn’t that interesting. Both values will give you something that is in between the two masses. You often find such distributions in count data where there is a normal distribution around some mean and another around 0 or 1 (often just one of the two).

Imagine if we looked at the number of times per month that a person from a community visits a drug store. There will likely be some people who go semi-regularly (say 4-5 times/month) and they will generally fall into a nice normal distribution. However, there will also likely be a large group of people who never go and will all mass on 0. In this case, means and medians are meaningless since they combine two distinct groups.

The lesson here is simple, make sure you know what the statistics you are looking at represent. If you can, always examine the entire distribution. If you see a skew in the data, dig a bit deeper.

Part 1 - The Sample with the Built-in Bias

Part 3 - The Little Figures That Are Not There
Part 4 - TMuch Ado About Practically Nothing


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb
Encyclopedia of Life Launched

I’ve been waiting patiently for the launch of the Encyclopedia of Life and it has finally come. For those of you unfamiliar with the EOL, here’s a blurb from their about page:

The Encyclopedia of Life (EOL) is an ambitious, even audacious project to organize and make available via the Internet virtually all information about life present on Earth. At its heart lies a series of Web sites—one for each of the approximately 1.8 million known species—that provide the entry points to this vast array of knowledge. The entry-point for each site is a species page suitable for the general public, but with several linked pages aimed at more specialized users. The sites sparkle with text and images that are enticing to everyone, as well as providing deep links to specific data.

As of today they have 30,000 pages (species) up and it’s amazing. I played around a bit and can’t seem to stop. For no apparent reason, I particularly enjoyed the page on the Anolis carolinensis (Green Anole).



Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb
Congrats To Matt


A recent article at the MIT Technology Review featured some of Matthew Hursts visualizations of the blogsphere. They are quite stunning. Congrats!


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb
Songs Expressed with Graphs

Andrew over at Information Aesthetics recently posted about a very funny flickr set of graphs that represent song lyrics. Definitely worth a look.


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb
On The Audacity of Data

Noam Sheiber of The New Republic recently published a brilliant article summarizing Senator Barack Obama’s mentality with respect to actionable economic strategies titled, The Audacity of Data . Sheiber made the distinction between former strategies focused around deduction (as used by former President Clinton) and those of induction. He pointed out that while the former model favors starting with specifics and moving towards generalities, the latter, does the opposite…and that is precisely why I support Senator Obama.

Sheiber points out that Senator Obama is a student of the Dick Thaler school of thought. Much like Thaler’s advance in economics with the introduction of behavioral theory, Obama follows a belief in rethinking established ideas and “changing the game.”

I would be doing injustice to Sheiber by further summarizing this article, so I will simply urge anyone interesting in understanding why so many of us think Obama will change this country for the better to read the article.


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb