Thoughts


23
Aug 10

Next Steps

I’m pleased to report that I have accepted an offer to join Carnegie Mellon University’s Heinz College as a post-doctoral fellow.  At Carnegie Mellon, I will be working with Alessandro Acquisti.  I have been following Alessandro’s excellent work on privacy and technology for many years, so I am thrilled to join his team and have him as a mentor.

Alessandro’s team has extensive experience studying privacy in online social networks.  Alessandro and Ralph Gross wrote one of the earliest (and most cited) Facebook privacy papers: Imagined Communities: Awareness, Information Sharing, and Privacy on the Facebook. Last summer, the team published a truly head-turning study, showing that information gleaned from social network profiles could be used to predict social security numbers.  Most recently, Alessandro’s work was featured in Jeffrey Rosen’s New York Times Magazine article The Web Means the End of Forgetting.

I look forward to building on my current areas of research – privacy, identity and support in social networks – while being exposed to new opportunities and new challenges at CMU.  Speaking of challenges, the next challenge is a dissertation defense (later this fall) and then a move to Pittsburgh.  It has been a while since I’ve been to Pittsburgh, so I’m open to advice!


30
Jun 10

Smaller, better, slower

On the O’Reilly Radar Blog, Linda Stone posted an interesting expansion on comments in the recent Economist article featuring Freedom.  Stone had been bearish on the general idea of Freedom and its ilk:

Ms Stone says Freedom and other such programs are “a first step”, since anyone who installs and uses one of them is admitting that there is a problem, and “something needs to shift”. But the next step is to go beyond a software crutch, Ms Stone says, and to learn to change one’s behaviour without the need for full-screen modes and internet-disabling utilities.

In the blog post, she expands on the general concept:

I’m not opposed to using technologies to support us in reclaiming our attention. But I prefer passive, ambient, non-invasive technologies over parental ones. Consider the Toyota Prius. The Prius doesn’t stop in the middle of a highway and say, “Listen to me, Mr. Irresponsible Driver, you’re using too much gas and this car isn’t going to move another inch until you commit to fix that.” Instead, a display engages us in a playful way and our body implicitly learns to shift to use less gas.

With technologies like Freedom, we re-assign the role of tyrant to the technology. The technology dictates to the mind. The mind dictates to the body. Meanwhile, the body that senses and feels, that turns out to offer more wisdom than the finest mind could even imagine, is ignored.

I’d suggest reading the whole post – it’s good and very thought provoking – but I take issue with the central premise of Stone’s argument, that it’s just a matter of time until we “create personal technologies that are prosthetics for our beings.”

Here’s my argument:  There’s no question that Freedom is a tyrant: but Freedom doesn’t control you, it controls technology.  And I have to believe that to many industry insiders, this is an uncomfortable direction for technology to take.

It is not controversial to claim that the dominant ideology of computing in the modern era has been “bigger, better, faster.”  In fact, this ideology – the connection between technological progress and advancement as a civilization – has stuctured the way we think about ourselves and other societies for hundreds of years.  In the epilogue to his excellent book Machines as the Measures of Men, Michael Adas writes:

The long-standing assumption that technological innovation was essential to progressive social development came to be viewed in terms of a necessary association between mechanization and modernity.  As Richard Wilson has argued, in American thinking, the “machine and all of its manifestations – as an object, a process, and ultimately a symbol – became the fundamental fact of modernism.”

Since the origins of the computing industry, Ruth Schwartz Cowan argues in A Social History of Technology, the focus has been squeezing productivity out of  machines and operators.  This logic of practice was inscribed to the industry “because the government [the dominant early contractor of the computing industry], fighting the protracted cold war with the Soviet Union, believed that it would need better and better computation facilities…”

This constant drive towards efficiency has many rewards: Transistors that are orders of magnitude cheaper than ones produced just years prior, Terabyte disks that sit on desktops, and the iDevices that I so covet.  My argument does not downplay the value of such advances, and to do so would be foolish.

Rather, I argue that the drive towards bigger, better, faster has left us with devices that are out of sync with our work patterns.  To address the growing divergence between our devices and work practice, we’ve constructed and attempted to empiricize the concept of multi-tasking.  Multi-tasking, as we now know, has decreasing marginal effectiveness as task complexity increases.  Multi-tasking fails most those who need it most.

Flipping through the last ten years of CHI, CSCW, and GROUP proceedings, we see an array of systems built to support multi-tasking, to facilitate remote work, to prostheticise our beings.  In these technologies we see the march towards progress, efficiency: bigger, better, faster.

Freedom joins these technologies in the march towards progress and efficiency, but with a different value set: smaller, better, slower.

In the past five or ten years, the devices we use for work have exploded in complexity.  No longer a word processor or spreadsheet, our computers are now televisions, game machines, and – most importantly – a portal to an always-on channel of social exchange.  Yet because these changes have been realized in code as opposed to form, we think of the device as static.  A computer is just a computer.  Rather, I see devices that are increasingly beginning to fail the market, with disastrous consequences for productivity, progress, and self-worth.

Freedom has always been about control.  It was first designed to reclaim space – to return the pre-internet state of a coffee shop that has suddenly gone wi-fi.  Only through extensive use have I realized that Freedom is about pushing back at the device itself, a device that has failed the work market in a drive toward progress.

In closing, Linda Stone asks “What tools, technologies, and techniques will it take for personal technologies to become prosthetics of our full human potential?”  First, we must understand that we, humans, are not the problem.  Second, we must reconsider our relationships with our devices, and examine with open minds where our devices have failed us.  Third, we must change the ideology of the productivity industry, moving away from bigger, better and faster and towards smaller, better, and slower.

Of course, this is easier said than done.  And it will almost certainly come from outside industry, which is constrained by its dominant logic of practice.  But I can’t help but think that we’re at the beginning of something big.


3
May 10

On Twitter and Ethnicity

A few days ago, I stumbled upon a post from the blog Business Insider that asked “Why Is Twitter More Popular With Black People Than White People?” Drawing on data from Edison Research, the writer proposed a number of explanations for why “black people represent 25% of Twitter users, roughly twice their share of the population in general.”  This factoid has now been reported by the New York Times, the San Francisco Chronicle, The Atlantic, as well as a number of prominent blogs.  It’s also going viral in the Twittersphere.

I’m loathe to trust bloggers getting survey data right, so I requested a copy of the report from Edison Research (available here).  At first glance, the data looks good – the research was conducted by Arbitron, it employs a landline/mobile random digit dialing (RDD) frame, with about 1,750 people age 12 and older interviewed.  “National probability” studies of this sort are generally considered valid for population estimates.

Without getting into too much detail, a study’s validity is dependent on the sampling method and sample size (among many other things).  In terms of method, RDD is not a true equal-probability of selection method, but both industry and academia consider it “good enough” when the sample is weighted to known totals.  As for size, a sample of 1750 people allows us to make claims about a large population at an error rate of about plus or minus 3 percent.

Let’s cut to the chase: Where did the Edison Research interpretation go wrong?  In the report, Tom Webster states:

The percentage of Twitter users who are African-American currently stands at roughly 25%, which is approximately double the percentage of African-Americans in the current U.S. population. Indeed, many of the “trending topics” on Twitter on a typical day are reflective of African-American culture, memes and topics.

From this, we are to believe that of all Twitter users, 25% are African-American.  Not only is this surprising considering current population estimates, but also because Twitter is a global service.  Let’s explore how Edison got to this 25 percent number (conveniently rounded up from 24 percent).

In the phone interview, Edison asked all respondents 12+ (n=1750) if they “currently ever use[d] Twitter.”  7% of respondents said yes, approximately 123 people.  Of those 123, Edison then asked how often they used Twitter.  85% of those respondents (105 people) indicated they used Twitter at least once a month, and were thus recoded as “Monthly Twitter Users.”  Herein lies the problem: It was from these 105 individuals (not the 1750 total respondents) that Edison based its estimates of Twitter use.

Let’s return to sampling error.  Because random samples are asymptotically efficient, a sample of 1750 can speak to a population of hundreds of millions almost as well as a sample of 2000, 3000, or even 5000.  But a sample of 105 people speaking to the very large userbase (self reported at 100 million) of Twitter? Not so efficient.  The margins of error are approximately +/- 10% at an alpha of .05, +/- 12.5 at an alpha of .01.  And these margins assume true equal probability of selection, and no nonresponse bias.  With weighting for proportionality, it is almost certain these margins increase substantially (1).

Let’s explore what this means practically.  First, Edison Research can’t speak to all Twitter users, because all Twitter users weren’t potentially included in the sample.  Edison can, however, speak to USA Twitter use, from its sample of 105 monthly users.  If we assume that only 5 million Twitter users in the USA use the service every month, Edison is still using 105 people to speak about these 5 million people (the margins of error don’t change).  Unfortunately, this is highly unreliable.

The American Community Survey finds that approximately 13.1% of the US population self identifies as Black or African American.  At an alpha of .05, the range of potentially true estimates of African-American Twitter use in the US is actually anywhere from 14% to 34%.  At an alpha of .01, this estimate ranges anywhere from 11% to almost 38%, causing us to reject the hypothesis that the estimate is not attributable to sampling error or random effects.  If we then include weights in our estimates of error (likely the case because Edison’s sample over-represents people under 24), the growth in error causes us to fail to reject the null hypothesis at the .05 level as well.  We just can’t trust that the demographics of Twitter actually do vary from current population estimates.

Is Twitter “disproportionately” African American, White, Hispanic, or Green?  The simple fact is that from this data, we can’t say so with confidence.  If Edison had been a little more forthcoming with their sample sizes, it might be more likely that the blogger/journalist who reported these data would have sensed something wrong.  But I wouldn’t bank on it, because it seems like Edison Research was pushing this spin from the get-go.

A final note: as I was researching/considering this piece, it was interesting to see the “spin” being placed on this “fact” around the blogosphere.  Of course, you had your standard racist comments/tweets of the “there goes the neighborhood” variety, but there also appeared to be a large swath of users who were heralding this as a point of pride.  Before you examine my subconscious racist motives for examining this question, please just know I like getting surveys right.  And if Edison wanted to get this right, they could start by giving us a topline cross-tab of ethnicity, Twitter use, and the respective margins of error.

Ugh, footnotes on a blog!

1. Research consistently demonstrates a negatively correlated relationship between age and nonresponse; young users are more likely to under-respond, increasing their odds of being weighted in a population (and increasing their margins of error).  Research is mixed on the relationship between ethnicity and nonresponse.


22
Apr 10

Social Technology and Teenage Discussion Networks

On Tuesday, the Pew Internet and American Life Project released a new, must-read report on Teens and Mobile Phones.  The project was a collaboration between Pew and the University of Michigan’s Communication Studies department, and it involves some of the top researchers of teens and technology (Amanda Lenhart, Richard Ling, Scott Campbell and Kristen Purcell).

In addition to releasing the great report, Pew did something new by simultaneously releasing the data sets used in the report (if I’m not mistaken, they’re usually embargoed a few months).  As someone who pays very close attention to Pew’s research, I was very pleased to see this – if I have questions or want to explore something further, I could go right to the data.

One of the questions in the Pew report was a modification of the General Social Survey’s (GSS) “discussion networks” question.  Questions of this sort ask individuals to list how many people with which they can discuss personal matters, which seems to be a good proxy for one’s close, supportive network.  Using the GSS data, Peter Marsden found in 1987 that Americans, on average, have three discussants.  Replicating the analysis in 2006, McPherson and colleagues found that discussion networks had shrunk to an average of two.  There’s been plenty of criticism of the measure (my favorite being Peter Bearman’s Headless frogs.. paper, see also Fischer, 2009).  Most recently, Hampton and colleagues explored the effect of technology on discussion networks in a great Pew report entitled Social Isolation and New Technology.

One of the great promises of “social technologies” is that they connect us to important others.  By participating in a social network site, for example, we’re able to keep in touch with a broader range of diverse contacts.  Critics are quick to point out that all those ties may be meaningless; in research, we draw distinctions between tie strength.   Ellison and colleagues have demonstrated that use of Facebook among undergraduates increases a form of bridging (weak-tie) social capital.  The “important matters” question, on the other hand, is more reflective of bonding (strong-tie) relations.  Therefore we can use Pew’s new data to explore the relationship between use (and intensity of use) of social technologies and a teenager’s strong-tie supportive network.

First, some important notes.  From hereon I am going to be talking about novel data analysis.  This is a blog post, so I am going to keep the reporting informal.  If you wish to explore my analysis, or re-run it, I have included a zip file that contains the questionnaire, data, reasonably commented do-file and output log.  Sorry, R fans, Stata wins for survey analysis; these files are compatible with Stata 11.  The analysis I’ll be talking about is weighted (individuals as PSU, using PSRAI’s omnibus weight).  The dependent variable is an overdispersed (mean=~5, variance=~10) count, the proper regression being negative binomial (confirmed with LR test on the alpha).  Finally, the question explored in this analysis is not a direct match to the GSS question, it is actually quite different (GSS is a name generator).  Therefore, the results are not directly comparable, but they are likely informative.  See the Pew report methodology section for a full description of the sample.

Teenage Discussion Networks

For the Teens and Mobile Technology study, interviewers spoke to 800 teens age 12-17, asking a range of questions about technology use.  Included in the questionnaire was the question about discussion networks.  In this questions, interviewers asked how many people the individual “feel[s] very close to and with whom you are frequently in contact to discuss various things, including your personal issues and feelings.”  The mean response was a little over 5, with a standard deviation of three.  The density plot is included at right.

First, I explored if demographic and socio-economic factors were associated with the size of teenage discussion networks.  Pew collected data on age, gender, family income, parent’s ethnicity, and total number of kids in the household.  These variables could impact the teen’s ability to form discussion networks for a variety of reasons, so it is worthwhile to retain them as control variables.  I found only one variable significant: being of “black, non-hispanic” parentage.  Compared to teens of “white, non-hispanic” parentage, teens of “black, non-hispanic” parentage have a lower incidence rate of reported discussants (IRR=.8041, p=0.011, Model1.pdf).

Next, I wanted to explore the effects of internet use, social network site use, and mobile phone ownership on the size of teenage discussion network, controlling for demographic factors.  I found that use of the internet, use of social network site, and ownership of a mobile phone were all positively and significantly (p<.05) associated with the size of the support network (Model2.pdf).  Importantly, ethnicity remained negative and significant, indicating that teens of “black, non-hispanic” parentage do not make up the gap in the support network size due to technology use.

Of course, most teens do not use technology in isolation.  In fact, Pew’s report indicates that most teens use the internet, SNS, and mobile phones in combination.  Therefore, we should explore the effects of these technologies simultaneously to identify the robust contribution to the size of the discussion network.  When we evaluate these simultaneously controlling for demographic factors, we find that internet use and mobile phone use no longer significantly contribute to the size of a teen’s discussion network.  Use of social network sites, however, remains significant (IRR=1.142, p=.028, Model3.pdf).  It appears that teens who use social network sites are more likely to report larger discussion networks.  This is pretty impressive!

Before we get too excited about the promise of social network sites, let’s consider what we know about them.  For most teens, the social network site represents an online space for interacting with offline friends.  If use of the social network site really adds people to the core discussion network, where are they coming from?  Couldn’t an alternate explanation be that individuals who are more social offline are also more social online?  Pew also asked about frequency of offline socialization, and we can enter this measure as a control in our model.  When we do, we see that none of the technologies remain significant, and offline interaction emerges as a significant predictor (IRR=1.074, p=.010, Model3.pdf).  It turns out that teens that are more active with their friends have larger discussion networks, controlling for demographics and social technology use.

Some Discussion

It should be noted that Pew’s report did contain a number of “technology intensity” or “differential technology use” variables (e.g. how often do you…).  I included these in my exploratory analysis and none were significant, so I focused on use effects.  In the study of “social impact of technology”, there is a long history of attribution error regarding the “effects of technology.”  My goals in this analysis were twofold: First, to explore a re-occurring question that is addressable with Pew’s data (is technology use robustly associated with larger discussion networks), and to explore some alternate hypotheses to the findings (a common theme in “discussion networks” research).

What I see in this data is a manifestation of the ubiquity of technology in teenage life.  If our technology is used to connect to those around us, the effects of the technology will be constrained within the social setting.  What we may be seeing here is that teens that are already outgoing are more likely to use social technologies.  That is, the use of the network is built into the everyday processes that would be associated with the growth of a discussion/support network.  This finding is mundane, but it begs the question – how might we leverage technologies to enable less outgoing teenagers to expand their support networks?

Finally, please treat this post as a rough draft, a work in progress.  The fact I feel it is acceptable to write a blog post like this is evidence I’ve been in grad school too long, so it is time to get back to my dissertation.

Ugh, Citations on a blog!

  • Bearman, P. and Parigi, P.  (2004).  Cloning Headless Frogs and Other Important Matters: Conversation Topics and Network Structure. Social Forces, 83(2), 535–557.
  • Ellison, N. B., Steinfield, C., and Lampe, C.  (2007).  The Benefits of Facebook “Friends:” Social Capital and College Students’ Use of Online Social Network Sites.  Journal of Computer Mediated Communications, 12(4).
  • Fischer, C. S.  (2009).  The 2004 GSS Finding of Shrunken Social Networks: An Artifact?.  American Sociological Review, 74(4), 657–669.
  • Hampton, K., Sessions, L., Her, E. J., and Rainie, L.  (November 4, 2009).  Social Isolation and New Technology.  Pew Internet and American Life Project.  Retrieved November 4, 2009 from http://www.pewinternet.org/Reports/2009/18–Social-Isolation-and-New-Technology.aspx.
  • Marsden, P. V.  (1987).  Core Discussion Networks of Americans.  American Sociological Review, 52(1), 122-131.
  • McPherson, M., Smith-Lovin, L., and Brashears, M.  (2006).  Social Isolation in America: Changes in Core Discussion Networks over Two Decades.  American Sociological Review, 71(3), 353-375.

20
Apr 10

Announcing Freedom for Windows

I’m very pleased to announce that Freedom, my internet-blocking productivity software, is now available for Windows!

Over the past two years, countless people have written to me, asking if there is a version of Freedom for Windows.  I hated telling people that they couldn’t have Freedom.  I’m happy to report that if you’ve got a Windows XP, Vista, or 7 computer, you too can now experience Freedom.

Want to know a little more about Freedom?  Read about it in the New York Times Magazine, Salon.com, USA Today, Chronicle of Higher Education, LifeHacker, and others.  I’m also quite partial to the recent article on Freedom in the Guardian that starts: “With the help of a lovely man called Fred, I’m no longer in thrall to SamCam’s cape and Guido Fawkes.”

Let me know what you think!


19
Apr 10

Privacy in Social Software

Last week, I wrote a number of essays critical of Twitter’s decision to provide a collection of public Tweets to the Library of Congress for permanent archiving.  I argued that by taking user data and putting it into a public archive, Twitter had meaningfully restricted the privacy rights of users.  Some of you agreed with my position, many didn’t; but all who commented or wrote to me helped shape my thinking.  In this post, I want to provide a little more context on the nature of privacy in systems like Twitter.

Last week, I gave a talk on the dynamics of privacy in Facebook.  In the research, we modeled a behavior that is increasingly pervasive in Facebook: having a friends-only profile.  I want to draw attention to one slide from the talk:

In this slide, the two slopes you see are the growth of Facebook, and the proportion of UNC undergraduates with friends-only profiles.  Now, the data are on different axes, and Excel is fitting the lines, but the trend is meaningful.  With growth in the service we see a correlated turn towards privacy.

While the pattern I observe is only general to Facebook at UNC, other researchers have observed similar patterns of privacy behavior in other social software.  For example, as Friendster scaled,

[S]o too did the diversity of the social networks represented. A growing portion of participants found themselves simultaneously negotiating multiple social groups—social and professional circles, side interests, and so on.  (boyd, 2007)

With the increasing complexity of diverse audiences, individuals turned to a range of strategies to manage their privacy: multiple accounts, limiting disclosure, or simply dropping out of the service.  Regarding Myspace, Caverlee and Webb (2008) reported (bold is mine):

Overall, the fraction of private profiles is increasing with time, indicating that new adopters of social networks may be more attuned to the inherent privacy risks of adopting a public Web presence. We find that women favor private profiles 2-to-1 over men, and that (perhaps, counter-intuitively) younger users are more likely to adopt a private profile than older users. We also find that the more connected a user is in the social network, the more likely she is to adopt a private profile.

And now in Facebook, our research finds a similar movement towards privacy as the service grows and networks diversify.  One can only suspect that Facebook’s recent “privacy upgrades” and changes to the terms of service prohibiting privacy of certain information has something to do with this normative shift.

Looking at the data across systems, I’d like to speculate that there’s a general property at work.  In a social software system, as the system grows and diversity of networks increases, so does utilization of privacy.  Here’s a graph I’ve constructed illustrating the trend (larger version):

The slope is purposefully convex. In the early stages of adoption, network use is sparse, so individuals are incentivized to lower privacy, to increase the odds of finding others. As time passes and the service grows, individuals form dense, small-world clusters. At this stage, individuals are mainly connected to one another within one context, and there are minimal bridges between contexts. Therefore, individuals can afford to keep privacy low, due to minimal risk of inadvertent sharing across context. As the system expands, however, we see a turn back towards privacy as an increasing number of bridges across context are created. In this moment of context collapse, individuals erect barriers of privacy to facilitate continued disclosure.  Here’s a closer look at the (simulated) networks:

By linking privacy to context collapse, I argue that mobilization towards privacy is largely a function of perceived audiences (and harms).  This distinction is important because it holds privacy attitudes constant.  Research, both mine and by others, has demonstrated that privacy attitudes do not necessarily predict privacy behaviors.  Awareness of privacy-in-context is actually the key variable causing the dynamic shift towards privacy in social software systems.

Let’s return our attention to Twitter.  What does your Twitter network look like?  If you’re an average user, your network probably contains a few offline friends (many, many fewer than Facebook or Myspace) and some celebrities (your definition may vary).  There may also be a few friends you’ve made on Twitter, who you don’t know offline.  Chances are, the average Twitter user’s network looks like the sparse “Early Adopter” or “Small World” network.

We see evidence in cultural practice that users have sparse networks in Twitter.  Going back to my notes on Alice Marwick’s AOIR ’09 talk, the culture of celebrity serves a very functional purpose for Twitterers with sparse networks, who wish to connect out of  limiting contexts.  “Talking” to celebrities (and finding others who talk to the celebs you talk to) is a way of escaping one’s sparse world, finding new people to follow in a known context.  Hashtag culture provides further evidence that individuals are trying to talk “across” or “out” of limited contexts.  If your network is sparse, turning to site-level anchors like hashtags and celebs provides a reliable stream of conversation in networks where conversation is lacking due to structural impediments.

I wonder how long these practices will need to continue.  Just the other day, Twitter announced that 100 million people had created accounts.  You can’t turn the news on without hearing about Twitter.  A large group of people, primed on social software by Facebook, are waiting to join Twitter.  And over the next year or two, they will, raising issues of context collapse, and prompting a turn toward increased privacy among early adopters.

My major problem with the Twitter/LoC agreement is that the people who will be confronted with context collapse and a growing need for privacy have lost meaningful recourse.  As I argued in my last post, it becomes impossible to take back what you’ve shared, a real and useful privacy strategy.  You’ll still be able to make your account private, but it seems there’s little you can do about the Tweets you sent that were archived permanently in the Library of Congress.

Why is this bad?  Let’s consider a hypothetical.  In 2007, Myspace had 100 million users.  Myspace was growing fast, with many users signing on for the first time.  Myspace users had two options for privacy: public or friends-only.  And a lot more people had public profiles in 2007 then they do today.  How would we feel, now, if Myspace had given all of its public profiles to the Library of Congress for permanent archive in 2007? I can only guess that a bunch of people who had public profiles in 2007 might feel a little uncomfortable about it (cue the “it’s their own damn fault” chorus).

I guess I should feel relief that if Twitter is going to do this to users, at least they are partnering with the LoC (an admirable entity).  But, in reading what LoC staff is saying about this effort, I’m not comforted.  Of the dataset, LoC Blogger Matt Raymond writes “I’m certain we’ll learn things that none of us now can even possibly conceive.” National Archivist David Ferriero writes “What will historians be able to glean from our tweets?  We can’t be sure, but it will probably be very interesting” (while also stating “Twitter is not for everyone. If you are anything like me, you don’t really care what someone had for breakfast.”)  It strikes me that the Twitter archive is being treated like a novelty, promising to be an amazing treasure trove when new research methods are developed.

Maybe it’s all these years of running t-tests (developed 1908), but I’m skeptical that these Tweets are going to tell us something that we can’t quite imagine.  Robust methods develop slowly, and are validated over time.  We’ll probably still be doing text mining, linguistic and sentiment analysis, and content analysis 50 years from now.  One area that is improving rapidly, however, is the identification of individuals in large data sets.  The Netflix dataset was identified by Narayanan and Shmatikov.  Acquisti and Gross demonstrated they were able to guess people’s social security numbers from public data.  And old-fashion detective work by Michael Zimmer identified the T3 Facebook dataset.  Of the future, we know this: It will be easier to connect you to your archived Twitter identity.

So here’s the thing.  Why won’t Twitter make the archiving a simple, opt-in process?  Or at least allow people to opt out?  Twitter obviously knows that giving user data to a permanent archive is different from sharing an API or allowing a Google spider – they wouldn’t have approached the LoC if this wasn’t the case.  I may be the only voice shouting about this, but this is a big, watershed moment regarding user privacy.  EFF, EPIC, Facebook watchdogs – where are you?  Let’s work with Twitter and make this right.


16
Apr 10

Is it time to cancel your Twitter account?

I was pleased to see that my last post on Twitter and the LoC generated excellent discussion both here in the comments and over in Twitter.   I’ve seen some great defenses of the deal, but unfortunately I’m not buyin’ quite yet.  I thought I’d use this post to quickly raise a few more questions and concerns.

First, a quick review of some of the conversation about the dealZimmer is all over it, raising a number of great open questions, and exloring how private tweets just might end up in the LoC’s archive.  The Atlantic has rounded up opinions, particularly an interesting conversation going on at The Big Money.  Also notable is a BBC interview with Twitter’s general counsel, though it skips over privacy issues.  Now that I think of it, skipping over privacy issues might be the theme of this essay.

One of the central problems with this deal are the set of assumptions around public Tweets.  Particularly, because the Tweets are “already public“, individuals lose all rights to the content.  In my last post, I drew explored some ways in which content shared in public actually wasn’t public content.  For example, practically obscure public content that is meant for a select audience.  In this post, I want to challenge another assumption that people make about public content: that it lives forever.

If there’s one thing that social media has taught us, it is that if you post anything to the web, it stays there forever.  Of course, this is empirically false.  Companies go out of business, databases corrupt, servers crash, indexes get expunged, identifiers get mixed up, and even with the best intentions and good backups, data are lost.  Think about the Google search results for your name.  Are they the same they were 1, 3, or 5 years ago?  While it is likely that you could tell me tons about new results that have come online over that time period, could you tell me about the ones that have gone offline?

So let’s just take a second and put the assumption that the internet is a giant cache to bed.  The internet is dynamic, fragile, and designed to lose things.  The internet has probably forgotten more about you than it remembers.  The next question generally brought up is “What about Google!”  If you want an answer to that question, send out a Tweet and then delete it.  Wait a few days and search for it.  The Tweet is gone, because Google isn’t in the business of sending you to 404′s.  Thank the market for that one.  After we knock down the Google straw man, the next assumption generally covers the suspicious “other” person who is stalking you and creating a giant portfolio of everything you do.  I hate to pop everyone’s bubble, but unless you’re a really, really significant public figure, this person doesn’t exist for you.

So why is it that we all assume that the content we share publicly will be around forever?  I think this is a classic case of selection on the dependent variable.  When we Google ourselves, we are confronted with what’s there as opposed to what’s not there.  The stuff that goes away gets forgotten, and we concentrate on things that we see or remember (like a persistent page about us that we don’t like).  In reality, our online identities decay, decay being a stochastic process.  The internet is actually quite bad at remembering.

The Library of Congress, on the other hand, is quite good at remembering.  Magnificently good at it, most likely the best in the world.  And that is what’s troubling.  Up until Twitter sent its archives over to the Library of Congress, Twitter users could realistically expect they could make things go away.  They could delete Tweets.  They could change their account name.  They could remove their account.  Without consulting their users, privacy advocates, rights organizations, or any other voices of reason, Twitter has summarily taken these very real privacy remedies away from their users.

This gets me to what is so frustrating about Twitter’s move: a frighteningly cavalier attitude towards shipping around the data of tens of millions of consumers.  Twitter has literally passed the personal information of millions of users to a permanent, public archive without so much as pre-notification, consultation, or the opportunity for debate.  And while even though it appears legal for the LoC to have the data, big questions remain regarding whether Twitter has actually violated its own contract with users.  How can I meaningfully own my content after it has been shipped to a government archive?

In all my years of using Twitter, the idea of canceling my account has never even vaguely crossed my mind.  Until last Wednesday, that is.

Update: American Prospect has a great interview with Martha Anderson of the Library of Congress.  Regarding the deal:

The agreement has been signed, but we still have a lot of technical details to work out — how we’ll technically transfer it, and when.

Regarding opt-out:

You know, I don’t know. I think that’s a question for Twitter. There’s several questions about that which they are still working out. We asked them to deal with the users; the library doesn’t want to mediate that.

Regarding user information:

I think that’s one of the big issues for us to understand in terms of privacy. And there’s a lot of work going on, especially over at [the National Institutes of Health] about how to anonymize data and still make it useful. We’re really big on partnering with people to learn what they’re learning, so I think that’s an area we’ll look into. In serving it, what can we do to make it useful to research but not identify personal information?