Facebook Dataset Identified

It now appears that the Tastes, Ties and Times dataset has been identified.  According to privacy scholar Michael Zimmer, the dataset of Facebook profiles is from Harvard College.  In my original post on the matter, I discussed how “fingerprints” of friend networks could be used to identify the dataset.  It did not require such complicated measures.  Using the codebook and statements from the researchers, Dr. Zimmer was able to target and ultimately identify the source of the dataset.  Importantly, now that the dataset is identified, it would be trivial to run a network comparison and produce probability estimates of the individuals in the anonymized set.

In an article to be published in Social Networks (Lewis et al., 2008), the authors provide more insight into the set.  This information seems to support the Harvard hypothesis, providing demographic information on the sample that could be correlated with statistics from the registrar.  This information, once semi-private, is now completely public.  It is only a matter of time before a grad student or assistant prof, seeking a publication and a little press, identifies the set (and no, it won’t be me).

In the discussion between Zimmer and the PI, a number of common themes emerge.  They include the notion that Facebook users have no right to privacy, that by sharing, users actually intend for the information to be public.  This is a straw man hypothesis, one that assumes an intentionality on behalf of the individual that simply does not exist.  Even in a semi-public like Facebook, our expectation of audience and viewership is quite small (a recent survey found that most users expected their profiles to be viewed primarily by their close group of friends).

This episode is an important example for IRB’s*, which have widely different interpretations of social networks research.  The goal of the IRB is to prevent subjects from harm that arises from the research process.  I am in agreement that subjects who post public profiles are open to research, as long as the research isn’t personally identifiable and properly protects the subjects.  This is clearly a different case, in which data sourced for acceptable research purposes was repurposed, and its form now clearly poses a risk to the subjects.  I want to be clear about this point, though.  The original research mission (to collect and analyze a set with proper safeguards) was within bounds; the follow-up distribution is the element that clearly poses risk.

The researchers should have convened a panel with a privacy expert (like Dr. Zimmer) to assess the risks of data disclosure to the human subjects.  Had such a panel taken place, I am confident that the PI’s would have assessed the risks of disclosure in a different light.  Perhaps that is the takeaway from this situation.  Research that pushes the boundaries of technology and privacy provide IRB’s with unique challenges.  Some IRB’s respond conservatively, stifling research and innovation.  Finding the balance that encourages innovative research while protecting subjects is a challenge, and perhaps the right place for an expert mediator.  Should Schools of Information prepare information ethicists for this role?

* IRB = Institutional Review Board, a panel of local experts in research ethics and methodology that oversees institutional research, both in industry and academia.

Lewis, K., Kaufman, J., Gonzalez, M., Wimmer, A., and Christakis, N.  (2008).  Tastes, ties, and time: A new social network dataset using Facebook.com.  Social Networks, In Press, Accepted Manuscript. http://www.sciencedirect.com/science/article/B6VD1-4T3M686-1/1/9c1b6aafad0f69c524f7c5f982eb2268

8 comments

  1. “Should Schools of Information prepare information ethicists for this role?” — YES!

  2. For some reason I thought you might agree with that!

  3. “The original research mission (to collect and analyze a set with proper safeguards) was within bounds; the follow-up distribution is the element that clearly poses risk.”

    Very true. One of the key challenges here is that the NSF required the release of the data as a condition of funding. But with such a small dataset, and all the contextual clues, the attempt to maintain anonymity failed.

    So, do we need to work on better methods for anonymization of small datasets, or force granting bodies to rethink the requirement to make *all data* publicly available?

    Likely, both….

  4. Interesting case, and I strongly aggree with the idea of revising/updating ethical research guidelines (also mentioned by Michael Zimmer in his post). Since I’m coordinating a similar discussion group within the “German Association of Internet Research” (DGOF), I’d be interested in a citation for your remark: “a recent survey found that most users expected their profiles to be viewed primarily by their close group of friends”
    Could you point me to the resource?

  5. [...] Fred Stutzman bin ich auf ein Beispiel aufmerksam geworden, das die Probleme der Freigabe von anonymisierten [...]

  6. I’ve just come across this post. I’d be interested in more specific details of what sorts of risks that Facebook users might encounter as a result of this information being released.

Leave a comment