Via the Berkman Center, news of a Facebook dataset now available to the general public. I haven’t written up the necessary research statement to access the data, but the publicly-available codebook provides insight into the set. According to the codebook, the data is scrubbed, with personally-identifying data removed.
The “non-identifiability” of such a dataset is up for debate. A friend network can be thought of as a fingerprint; it is likely that no two networks will be exactly similar, meaning individuals may be able to be identified in the dataset post-hoc (for friend-network verification, see Zinman & Donath, 2007). Further, the authors of the dataset plan to release student “Favorite” data in 2011, which will provide further information that may lead to identification. According to the authors, the collection of the dataset was approved by the IRB, Facebook and the individual college. The dissemination of the dataset appears to be approved by the IRB.
In other news:
danah boyd recently gave a talk, “Understanding Socio-Technical Phenomena in a Web2.0 Era” at the opening symposium for MSR New England. The video is available in a WMV stream.
Via W.H., Iron – a version of Google Chrome with all Google reporting stripped out. In theory, this will also prevent the auto-update functionality, one I was never comfortable with.
Citation:
Zinman, A. and Donath, J. (2007). Is Britney Spears spam?. In Fourth Conference on Email and Anti-Spam, Mountain View, CA, 2007.
Fred Stutzman is a doctoral student, researcher and teaching fellow at the University of North Carolina at Chapel Hill's School of Information and Library Science. He studies how people use social media.





Hi Fred -
Do you have a view on the merits of the Dataverse Network Project?
Regards
Andrew
The non-identifiability of this dataset is indeed up for debate:
I think it’s hard to imagine that some of this anonymity wouldn’t be breached with some of the participants in the sample. For one thing, some nationalities are only represented by one person. Another issue is that the particular list of majors makes it quite easy to guess which specific school was used to draw the sample. Put those two pieces of information together and I can imagine all sorts of identities becoming rather obvious to at least some people.
Eszter is right on both counts. I’ve already figured out the school based on the majors listed (click the link on my name to the left). It would have taken the researchers little effort to make the majors generic.
Do you know if the IRB records are publicly available? I’m very interested in how the research was presented and how the IRB ruled on issues like this.