Archive for the ‘Visualization’ Category

What Does The Last.fm Friends Network Look Like?

A big interest of mine is social networks. I have a research project right now looking at the network structure of blogs (data still being collected…will be for another 2-3 months) and recently I got interested in the Last.fm friends network.

One of the ideas behind Last.fm is that they can provide music recommendations based on what people like you enjoy. On top of this they built a social aspect to their system where a user can have friends and see what they are listening to. Presumably this friends network also propagates their recommendations. An obvious question then is whether or not their actually is a user network or if friendships are localized so small groups with little contact with anyone else.

To answer this question I used the VERY open properties of Last.fm and, with the help of their API, began collected data on relationships between users. I first crawled the main last.fm site and pulled out 364,499 user names. Then I wrote a php script that systematically pulled all the friend information for each user. Respecting the TOS and limiting myself to 1 query per second, I’m still collecting data (6 days, 17 hours, and 34 minutes in) with about 8 more days to go. As of now, I have data for 166,332 users (99,030 of whom have any friends). That results in 843,514 friends at an average of 8.52 friends / user.

Conveniently, another one of my interests is visualizing large data sets. Unfortunately, I have no computer science training (only the tidbits of programming (php, flash/actionscript, perl) that I picked up as needed) and so I’m stuck with using pre-built visualization software. This wouldn’t normally be a problem, but given that I’m dealing with this much data, a custom built app would be very nice.

I’ve played around with Tulip visualization package a bit before and so I figured I’d give it another go. Unfortunately, when I dumped all my data into Tulip, I got a big fat crash. Turns out there’s no way it (or my system) can handle this much. After playing around with a few other software packages ( Pajek, UCINET, JUNG, and SoNIA) I decided to just look at a sample of my data instead.

So I picked the first 25,000 relationships in my data set (technically a random sample since I collected the data in a random manner), and after some tweaking (needed to get rid of duplicate reverse relationships) came up with a 2,310 seed users with 19,008 friends. This resulted in 24,036 relationships at an average of 10.41 friends / user (min = 1, max = 159…see distribution histogram below). I dumped that into Tulip and got some neat results. (If anyone knows how to increase the resolution of the saved images in Tulip, please let me know).

Each red square represents a user and each line represents a relationship
(Click on each image for the full version).

tn_toast.jpg

It’s pretty clear from this image that the user network is quite strong. There are certainly clusters of close friends, but also plenty of cluster inter-connectivity. Let’s zoom in a bit and see more detail:

tn_toast.jpg

Here we can see that there are more prominent users (size of square = # of friends) who have friends “orbiting” around them. This alone is pretty neat, but we can further see that the orbitals are also connected with other users, suggesting that there is a large amount of interconnectivity. Another zoom:

tn_toast.jpg

Again we see a similar pattern: seeds and orbitals.

But are all users connected? Well, not really. I cropped a part of the original image which had the following:

tn_toast.jpg

What we see here are all the users who are not connected to the “main network.” Now this can result from two things: 1) these users might just not have any connections to the main network, 2) because I’m looking at a sample of the data, I am missing the connections between these users and the main network. I suspect it’s a combination of both. Let’s zoom in a bit and see exactly what’s going on with these folks:

tn_toast.jpg

What we see is that even for these outliers, there appear to be some networks. We can see a big one right in the middle with a couple of dozen users and smaller 2-user “networks” throughout. Pretty neat!

Finally, you may be interested in knowing what the distribution of # of friends looks like. Well, here you go.

Clearly there is a big tail with the majority of users having only a few friends.

What’s next?

I would love to figure out a way to visualize all the data. If someone out there has the technical skills to do this (and presumably the computing power) I’d be more than happy to collaborate.

I’m also beginning to collect the listening history of users (again thanks to the wonderful API) and hope to examine music listening patterns as they relate to the network. That’ll be a much bigger problem because of the volume of data. I collected a small amount just to see what it would look like and for the 183 users that I checked, I already have 1,179,480 track plays. Scaling up to ~300k users is a bit much. Regardless, I may use the friends data to identify a sub-network of friends and track their listening patterns to see how they influence one another.

If you like what you see here, drop me a comment.


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • bodytext
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Slashdot
  • StumbleUpon
  • Live
  • Technorati
  • Reddit
  • YahooMyWeb