Posts Tagged: twitter


3
May 10

On Twitter and Ethnicity

A few days ago, I stumbled upon a post from the blog Business Insider that asked “Why Is Twitter More Popular With Black People Than White People?” Drawing on data from Edison Research, the writer proposed a number of explanations for why “black people represent 25% of Twitter users, roughly twice their share of the population in general.”  This factoid has now been reported by the New York Times, the San Francisco Chronicle, The Atlantic, as well as a number of prominent blogs.  It’s also going viral in the Twittersphere.

I’m loathe to trust bloggers getting survey data right, so I requested a copy of the report from Edison Research (available here).  At first glance, the data looks good – the research was conducted by Arbitron, it employs a landline/mobile random digit dialing (RDD) frame, with about 1,750 people age 12 and older interviewed.  “National probability” studies of this sort are generally considered valid for population estimates.

Without getting into too much detail, a study’s validity is dependent on the sampling method and sample size (among many other things).  In terms of method, RDD is not a true equal-probability of selection method, but both industry and academia consider it “good enough” when the sample is weighted to known totals.  As for size, a sample of 1750 people allows us to make claims about a large population at an error rate of about plus or minus 3 percent.

Let’s cut to the chase: Where did the Edison Research interpretation go wrong?  In the report, Tom Webster states:

The percentage of Twitter users who are African-American currently stands at roughly 25%, which is approximately double the percentage of African-Americans in the current U.S. population. Indeed, many of the “trending topics” on Twitter on a typical day are reflective of African-American culture, memes and topics.

From this, we are to believe that of all Twitter users, 25% are African-American.  Not only is this surprising considering current population estimates, but also because Twitter is a global service.  Let’s explore how Edison got to this 25 percent number (conveniently rounded up from 24 percent).

In the phone interview, Edison asked all respondents 12+ (n=1750) if they “currently ever use[d] Twitter.”  7% of respondents said yes, approximately 123 people.  Of those 123, Edison then asked how often they used Twitter.  85% of those respondents (105 people) indicated they used Twitter at least once a month, and were thus recoded as “Monthly Twitter Users.”  Herein lies the problem: It was from these 105 individuals (not the 1750 total respondents) that Edison based its estimates of Twitter use.

Let’s return to sampling error.  Because random samples are asymptotically efficient, a sample of 1750 can speak to a population of hundreds of millions almost as well as a sample of 2000, 3000, or even 5000.  But a sample of 105 people speaking to the very large userbase (self reported at 100 million) of Twitter? Not so efficient.  The margins of error are approximately +/- 10% at an alpha of .05, +/- 12.5 at an alpha of .01.  And these margins assume true equal probability of selection, and no nonresponse bias.  With weighting for proportionality, it is almost certain these margins increase substantially (1).

Let’s explore what this means practically.  First, Edison Research can’t speak to all Twitter users, because all Twitter users weren’t potentially included in the sample.  Edison can, however, speak to USA Twitter use, from its sample of 105 monthly users.  If we assume that only 5 million Twitter users in the USA use the service every month, Edison is still using 105 people to speak about these 5 million people (the margins of error don’t change).  Unfortunately, this is highly unreliable.

The American Community Survey finds that approximately 13.1% of the US population self identifies as Black or African American.  At an alpha of .05, the range of potentially true estimates of African-American Twitter use in the US is actually anywhere from 14% to 34%.  At an alpha of .01, this estimate ranges anywhere from 11% to almost 38%, causing us to reject the hypothesis that the estimate is not attributable to sampling error or random effects.  If we then include weights in our estimates of error (likely the case because Edison’s sample over-represents people under 24), the growth in error causes us to fail to reject the null hypothesis at the .05 level as well.  We just can’t trust that the demographics of Twitter actually do vary from current population estimates.

Is Twitter “disproportionately” African American, White, Hispanic, or Green?  The simple fact is that from this data, we can’t say so with confidence.  If Edison had been a little more forthcoming with their sample sizes, it might be more likely that the blogger/journalist who reported these data would have sensed something wrong.  But I wouldn’t bank on it, because it seems like Edison Research was pushing this spin from the get-go.

A final note: as I was researching/considering this piece, it was interesting to see the “spin” being placed on this “fact” around the blogosphere.  Of course, you had your standard racist comments/tweets of the “there goes the neighborhood” variety, but there also appeared to be a large swath of users who were heralding this as a point of pride.  Before you examine my subconscious racist motives for examining this question, please just know I like getting surveys right.  And if Edison wanted to get this right, they could start by giving us a topline cross-tab of ethnicity, Twitter use, and the respective margins of error.

Ugh, footnotes on a blog!

1. Research consistently demonstrates a negatively correlated relationship between age and nonresponse; young users are more likely to under-respond, increasing their odds of being weighted in a population (and increasing their margins of error).  Research is mixed on the relationship between ethnicity and nonresponse.


19
Apr 10

Privacy in Social Software

Last week, I wrote a number of essays critical of Twitter’s decision to provide a collection of public Tweets to the Library of Congress for permanent archiving.  I argued that by taking user data and putting it into a public archive, Twitter had meaningfully restricted the privacy rights of users.  Some of you agreed with my position, many didn’t; but all who commented or wrote to me helped shape my thinking.  In this post, I want to provide a little more context on the nature of privacy in systems like Twitter.

Last week, I gave a talk on the dynamics of privacy in Facebook.  In the research, we modeled a behavior that is increasingly pervasive in Facebook: having a friends-only profile.  I want to draw attention to one slide from the talk:

In this slide, the two slopes you see are the growth of Facebook, and the proportion of UNC undergraduates with friends-only profiles.  Now, the data are on different axes, and Excel is fitting the lines, but the trend is meaningful.  With growth in the service we see a correlated turn towards privacy.

While the pattern I observe is only general to Facebook at UNC, other researchers have observed similar patterns of privacy behavior in other social software.  For example, as Friendster scaled,

[S]o too did the diversity of the social networks represented. A growing portion of participants found themselves simultaneously negotiating multiple social groups—social and professional circles, side interests, and so on.  (boyd, 2007)

With the increasing complexity of diverse audiences, individuals turned to a range of strategies to manage their privacy: multiple accounts, limiting disclosure, or simply dropping out of the service.  Regarding Myspace, Caverlee and Webb (2008) reported (bold is mine):

Overall, the fraction of private profiles is increasing with time, indicating that new adopters of social networks may be more attuned to the inherent privacy risks of adopting a public Web presence. We find that women favor private profiles 2-to-1 over men, and that (perhaps, counter-intuitively) younger users are more likely to adopt a private profile than older users. We also find that the more connected a user is in the social network, the more likely she is to adopt a private profile.

And now in Facebook, our research finds a similar movement towards privacy as the service grows and networks diversify.  One can only suspect that Facebook’s recent “privacy upgrades” and changes to the terms of service prohibiting privacy of certain information has something to do with this normative shift.

Looking at the data across systems, I’d like to speculate that there’s a general property at work.  In a social software system, as the system grows and diversity of networks increases, so does utilization of privacy.  Here’s a graph I’ve constructed illustrating the trend (larger version):

The slope is purposefully convex. In the early stages of adoption, network use is sparse, so individuals are incentivized to lower privacy, to increase the odds of finding others. As time passes and the service grows, individuals form dense, small-world clusters. At this stage, individuals are mainly connected to one another within one context, and there are minimal bridges between contexts. Therefore, individuals can afford to keep privacy low, due to minimal risk of inadvertent sharing across context. As the system expands, however, we see a turn back towards privacy as an increasing number of bridges across context are created. In this moment of context collapse, individuals erect barriers of privacy to facilitate continued disclosure.  Here’s a closer look at the (simulated) networks:

By linking privacy to context collapse, I argue that mobilization towards privacy is largely a function of perceived audiences (and harms).  This distinction is important because it holds privacy attitudes constant.  Research, both mine and by others, has demonstrated that privacy attitudes do not necessarily predict privacy behaviors.  Awareness of privacy-in-context is actually the key variable causing the dynamic shift towards privacy in social software systems.

Let’s return our attention to Twitter.  What does your Twitter network look like?  If you’re an average user, your network probably contains a few offline friends (many, many fewer than Facebook or Myspace) and some celebrities (your definition may vary).  There may also be a few friends you’ve made on Twitter, who you don’t know offline.  Chances are, the average Twitter user’s network looks like the sparse “Early Adopter” or “Small World” network.

We see evidence in cultural practice that users have sparse networks in Twitter.  Going back to my notes on Alice Marwick’s AOIR ’09 talk, the culture of celebrity serves a very functional purpose for Twitterers with sparse networks, who wish to connect out of  limiting contexts.  “Talking” to celebrities (and finding others who talk to the celebs you talk to) is a way of escaping one’s sparse world, finding new people to follow in a known context.  Hashtag culture provides further evidence that individuals are trying to talk “across” or “out” of limited contexts.  If your network is sparse, turning to site-level anchors like hashtags and celebs provides a reliable stream of conversation in networks where conversation is lacking due to structural impediments.

I wonder how long these practices will need to continue.  Just the other day, Twitter announced that 100 million people had created accounts.  You can’t turn the news on without hearing about Twitter.  A large group of people, primed on social software by Facebook, are waiting to join Twitter.  And over the next year or two, they will, raising issues of context collapse, and prompting a turn toward increased privacy among early adopters.

My major problem with the Twitter/LoC agreement is that the people who will be confronted with context collapse and a growing need for privacy have lost meaningful recourse.  As I argued in my last post, it becomes impossible to take back what you’ve shared, a real and useful privacy strategy.  You’ll still be able to make your account private, but it seems there’s little you can do about the Tweets you sent that were archived permanently in the Library of Congress.

Why is this bad?  Let’s consider a hypothetical.  In 2007, Myspace had 100 million users.  Myspace was growing fast, with many users signing on for the first time.  Myspace users had two options for privacy: public or friends-only.  And a lot more people had public profiles in 2007 then they do today.  How would we feel, now, if Myspace had given all of its public profiles to the Library of Congress for permanent archive in 2007? I can only guess that a bunch of people who had public profiles in 2007 might feel a little uncomfortable about it (cue the “it’s their own damn fault” chorus).

I guess I should feel relief that if Twitter is going to do this to users, at least they are partnering with the LoC (an admirable entity).  But, in reading what LoC staff is saying about this effort, I’m not comforted.  Of the dataset, LoC Blogger Matt Raymond writes “I’m certain we’ll learn things that none of us now can even possibly conceive.” National Archivist David Ferriero writes “What will historians be able to glean from our tweets?  We can’t be sure, but it will probably be very interesting” (while also stating “Twitter is not for everyone. If you are anything like me, you don’t really care what someone had for breakfast.”)  It strikes me that the Twitter archive is being treated like a novelty, promising to be an amazing treasure trove when new research methods are developed.

Maybe it’s all these years of running t-tests (developed 1908), but I’m skeptical that these Tweets are going to tell us something that we can’t quite imagine.  Robust methods develop slowly, and are validated over time.  We’ll probably still be doing text mining, linguistic and sentiment analysis, and content analysis 50 years from now.  One area that is improving rapidly, however, is the identification of individuals in large data sets.  The Netflix dataset was identified by Narayanan and Shmatikov.  Acquisti and Gross demonstrated they were able to guess people’s social security numbers from public data.  And old-fashion detective work by Michael Zimmer identified the T3 Facebook dataset.  Of the future, we know this: It will be easier to connect you to your archived Twitter identity.

So here’s the thing.  Why won’t Twitter make the archiving a simple, opt-in process?  Or at least allow people to opt out?  Twitter obviously knows that giving user data to a permanent archive is different from sharing an API or allowing a Google spider – they wouldn’t have approached the LoC if this wasn’t the case.  I may be the only voice shouting about this, but this is a big, watershed moment regarding user privacy.  EFF, EPIC, Facebook watchdogs – where are you?  Let’s work with Twitter and make this right.


16
Apr 10

Is it time to cancel your Twitter account?

I was pleased to see that my last post on Twitter and the LoC generated excellent discussion both here in the comments and over in Twitter.   I’ve seen some great defenses of the deal, but unfortunately I’m not buyin’ quite yet.  I thought I’d use this post to quickly raise a few more questions and concerns.

First, a quick review of some of the conversation about the dealZimmer is all over it, raising a number of great open questions, and exloring how private tweets just might end up in the LoC’s archive.  The Atlantic has rounded up opinions, particularly an interesting conversation going on at The Big Money.  Also notable is a BBC interview with Twitter’s general counsel, though it skips over privacy issues.  Now that I think of it, skipping over privacy issues might be the theme of this essay.

One of the central problems with this deal are the set of assumptions around public Tweets.  Particularly, because the Tweets are “already public“, individuals lose all rights to the content.  In my last post, I drew explored some ways in which content shared in public actually wasn’t public content.  For example, practically obscure public content that is meant for a select audience.  In this post, I want to challenge another assumption that people make about public content: that it lives forever.

If there’s one thing that social media has taught us, it is that if you post anything to the web, it stays there forever.  Of course, this is empirically false.  Companies go out of business, databases corrupt, servers crash, indexes get expunged, identifiers get mixed up, and even with the best intentions and good backups, data are lost.  Think about the Google search results for your name.  Are they the same they were 1, 3, or 5 years ago?  While it is likely that you could tell me tons about new results that have come online over that time period, could you tell me about the ones that have gone offline?

So let’s just take a second and put the assumption that the internet is a giant cache to bed.  The internet is dynamic, fragile, and designed to lose things.  The internet has probably forgotten more about you than it remembers.  The next question generally brought up is “What about Google!”  If you want an answer to that question, send out a Tweet and then delete it.  Wait a few days and search for it.  The Tweet is gone, because Google isn’t in the business of sending you to 404′s.  Thank the market for that one.  After we knock down the Google straw man, the next assumption generally covers the suspicious “other” person who is stalking you and creating a giant portfolio of everything you do.  I hate to pop everyone’s bubble, but unless you’re a really, really significant public figure, this person doesn’t exist for you.

So why is it that we all assume that the content we share publicly will be around forever?  I think this is a classic case of selection on the dependent variable.  When we Google ourselves, we are confronted with what’s there as opposed to what’s not there.  The stuff that goes away gets forgotten, and we concentrate on things that we see or remember (like a persistent page about us that we don’t like).  In reality, our online identities decay, decay being a stochastic process.  The internet is actually quite bad at remembering.

The Library of Congress, on the other hand, is quite good at remembering.  Magnificently good at it, most likely the best in the world.  And that is what’s troubling.  Up until Twitter sent its archives over to the Library of Congress, Twitter users could realistically expect they could make things go away.  They could delete Tweets.  They could change their account name.  They could remove their account.  Without consulting their users, privacy advocates, rights organizations, or any other voices of reason, Twitter has summarily taken these very real privacy remedies away from their users.

This gets me to what is so frustrating about Twitter’s move: a frighteningly cavalier attitude towards shipping around the data of tens of millions of consumers.  Twitter has literally passed the personal information of millions of users to a permanent, public archive without so much as pre-notification, consultation, or the opportunity for debate.  And while even though it appears legal for the LoC to have the data, big questions remain regarding whether Twitter has actually violated its own contract with users.  How can I meaningfully own my content after it has been shipped to a government archive?

In all my years of using Twitter, the idea of canceling my account has never even vaguely crossed my mind.  Until last Wednesday, that is.

Update: American Prospect has a great interview with Martha Anderson of the Library of Congress.  Regarding the deal:

The agreement has been signed, but we still have a lot of technical details to work out — how we’ll technically transfer it, and when.

Regarding opt-out:

You know, I don’t know. I think that’s a question for Twitter. There’s several questions about that which they are still working out. We asked them to deal with the users; the library doesn’t want to mediate that.

Regarding user information:

I think that’s one of the big issues for us to understand in terms of privacy. And there’s a lot of work going on, especially over at [the National Institutes of Health] about how to anonymize data and still make it useful. We’re really big on partnering with people to learn what they’re learning, so I think that’s an area we’ll look into. In serving it, what can we do to make it useful to research but not identify personal information?


14
Apr 10

Twitter and the Library of Congress

I’m currently at the CHI conference, which is commanding all of my attention, but the news about Twitter and the Library of Congress is too big to ignore (see also Zimmer, RWW).  Quoting the LoC:

Have you ever sent out a “tweet” on the popular Twitter social media service? Congratulations: Your 140 characters or less will now be housed in the Library of Congress.

According to Biz Stone, Twitter will begin transferring all of their public tweets, after a six-month embargo, to a permanent, public archive at the Library of Congress.  Let me say something (probably) unpopular: I’m a little horrified.

If you talk to people about things shared online, you generally run into two assumptions.  The first is that things shared publicly are meant for the general public.  The second is that things shared publicly are meant for posterity.  Both of these assumptions are dangerous.  Some of my recent work has identified that people do share privately in public, and that individuals do engage in the grooming (i.e. removal) of content shared publicly.  danah’s found this.  So have lots of others.  If there’s anything we should know by now about social media, is that a deterministic, one-size-fits-all approach to privacy is a bad approach to privacy.

This is what makes Twitter’s “gift” troubling.  It assumes that all content shared publicly is truly public and for posterity.  Let’s consider some edge cases.  Bob has two Twitter accounts, one for work and a personal account.  Both are public, but the only way people find out about his personal account is that he tells people the obscure handle.  Bob wants to be practically obscure – private in public – without going to all the trouble of setting up complicated privacy controls.  So what happens, two years from now, when Bob accidentally discloses his handle in the wrong context, and he needs to remove some Tweets?

There’s probably a certain class of reader that looks at Bob and says, well, Bob’s out of luck.  There’s Google cache and third party tools and a whole host of other ways tweets are preserved.  The difference I’d argue is that these tools have certain properties – they react to API calls, they decay, etc. – that make them qualitatively different from a professionally managed archive.  Through the creation of a permanent, public, third-party archive, Twitter changes the privacy-management strategies that are going to be available to users in the future.  This is critical, because if Bob can’t trust his down-the-road privacy management strategy, Bob might share less today.

This is a great opportunity to plug the work of Helen Nissenbaum, whose most recent book Privacy in Context extends the argument for privacy as contextual integrity.  Nissenbaum argues that disclosures have contextual expectations, and that shifting these expectations constitutes a meaningful violation of privacy and freedom.  Even though the tweets are public, it is a fallacy to assume that digital content shared in public was created with an understanding that the content would end up in a third-party, government-managed archive.  Facebook’s helped us demonstrate again and again that privacy is both qualitative and quantitative.

Practically, there are some questions that Twitter needs to address about this move.  First, Twitter’s terms of service specifies that:

You retain your rights to any Content you submit, post or display on or through the Services. By submitting, posting or displaying Content on or through the Services, you grant us a worldwide, non-exclusive, royalty-free license (with the right to sublicense) to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute such Content in any and all media or distribution methods (now known or later developed).

The way I read this is that as long as your content is on Twitter, Twitter can do what they want with it.  Fine.  But what if you remove your content from Twitter?  Wouldn’t Twitter’s licensing of your content to the LoC also expire?  Twitter needs to address exactly how we can pull our content out of the archive when we want.  Michael Zimmer thinks that Twitter users won’t have the ability to remove tweets from LoC, so how will Twitter rectify this in the terms of service?

A broader question is why Twitter didn’t just build this as an opt-in service.  Or even, less preferably, an opt-out service.  Is the collection so important that it is worth compromising user privacy?  I’ve got a feeling that there are certain assumptions around “public” content and the feel-good vibe of the Library of Congress that led to a lack of critical thinking about the implications of this move.  It’s time for Twitter to start sharing more information, opening up an earnest conversation about this move.


5
Aug 09

Teens Don’t Tweet, or, How to Read a Web Panel

In the past few months, we’ve seen a number of studies of dubious methodology make sweeping generalization about Twitter.  Examples include the Twitter gender study; another study asserted that because only one out of every five teens tweet, that teens don’t use the service (wouldn’t you like one out of every five teens to use your product?).   Nielsen joins this discussion by stating that “Teens Don’t Tweet”, based on  findings from their online panel.  They assert that “In June 2009, only 16 percent of Twitter.com website users were under the age of 25. Bear in mind persons under 25 make up nearly one quarter of the active US Internet universe, which means that Twitter.com effectively under-indexes on the youth market by 36 percent.”  Oh noes!

twitter_by_age

As these data will undoubtedly be reported breathlessly elsewhere, I thought it might be useful to step back and explore some of the issues with the methodology and conclusions.  So first, a note about the methodology.  The Nielsen NetView panel contains an impressive 250,000 users.  Metering software located on client machines records the websites visited by panel members.  A vast majority of the panel is recruited online; the panel is “calibrated” (weighted) against gold-standard sampling methods (Random Digit Dialing, etc.).

Survey weighting is a standard, fairly uncontroversial process.  It is commonly used and is thought of as preferable to census-type approaches that may systematically under-represent some populations.  However, reliable survey weighting gets tricky when the population is small.  Since teens are a notoriously hard-to-reach population, we generally see inflated standard errors around weighted teen respondents in a population survey.  Nielsen does not report standard errors, and the makeup of their panel is confidential, so therefore it is impossible to know how much error there is around the estimate of use.  If the panel is like other panels, though, there may be more error in young people than a high-response population, such as adults.  We’re very familiar with margins of error (the things you see in political polls, where error is reported as plus or minus 3 percent, etc).  An inflated error means the margin is larger, meaning that the estimate may vary by a larger amount.

This is not to put down Nielsen.  With 250,000 members, the panel likely has good coverage of young people.  Since my purpose is to use this example to critique web panels, we must point out two other issues.  First, bigger is not necessarily better if the sample is convenience driven.  Nielsen’s panel is very large, but simply because it is large doesn’t mean it is representative.  In fact, Nielsen is likely more interested in the larger size for better sparse-market coverage, as opposed to statistical reliability.  Second, the particular nature of recruitment into the main panel introduces selection bias.  If people aren’t selected randomly, then there may be characteristics of the population that covary with the variables of interest.  This is an omni-present issue with polling, but it must be noted.

So when we read a web poll of this particular nature, what are the critical questions we should be asking?  First, we should be concerned about cell size (the number of respondents) for a hard-to-reach population.  If young users are underrepresented, the standard errors on the estimates can be quite large (which may push an estimate around by +/- 10 percent).  We should also question the method of recruitment; if the majority of the panel comes in via the web, then who gets left out?  Since this poll is designed to represent online users, it is seems likely that heavy web users are participants (my guess).  But what if Twitter users actually aren’t like heavy web users?  There are a whole host of other questions we should ask regarding polls (response rate, sampling frame, etc) that generally aren’t answered in online polls.

It is important to understand the potential methodological issues when reading research.  Nielsen’s methods are standard for the industry, and they acknowledge the drawbacks and limitations.  In my opinion, the major problem isn’t the methods component, it is Nielsen’s spinning/presentation of its results.  In the Nielsen study (and the previous Participatory Media Network study), the findings focus on lack of teen use of Twitter.  However, the findings reported by Nielsen cover the following age ranges: 2-24, 25-54 and 55+.  The critical category, 2-24, covers a wide range of users – incredibly young children, adolescents, teens and adults.  The grand mean reported by Nielsen is affected by variation inside the different age categories.  Using census data, we can look at age breakdown over the ranges 2-24.  According to census, there are 80MM Americans under age 24 (0-24).  There are approximately 15-16MM Americans in the age ranges 0-4, 5-9, 10-14, 15-19, and 20-24.  Therefore, each category is pretty much weighed equally.  So lets (hypothetically) assume that no one age 0-9 uses Twitter, 5% of people age 10-14, 35% of people age 15-19, and 40% of people age 20-24 use Twitter.  To calculate the grand mean we would weight the percentages and then sum.  Such a formulation would give us 16% use for the demographic age 0-24 (0+0+.01+.07+.08).

The second problem with Nielsen’s presentation is the comparison range.  Comparing the age ranges 2-24 and 25-54 is not fair on a number of levels.  The first category can really only meaningfully cover age 13-24, while the 25-54 age range meaningfully covers 30 years.  If we weigh the estimates (16%/64%) by volume coverage (1:3 ratio), then the category volume for older users would be ~21% (I didn’t bother to weigh by census, just an estimate).  And what if we compared just teens against an adult category – we might even find that teens Twitter more than adults.  Keep in mind, with all the advantages afforded to older users (no Internet restrictions, etc) there are major differences between older users and teen/young users in their capacity to partake in online community.

My analysis is simplistic and speculative, but in certain configurations, “young people” could plausibly use Twitter at higher rates than “adults.”  I don’t have a guess regarding what is right, but my gut tells me that if Nielsen was more upfront regarding their sampling, and less misleading with their infographics, we’d have a different story.  And that story would not be as catchy and headline-grabbing as “Teens Don’t Tweet.”


4
Jun 09

Rethinking Twitter and Gender Differences

On Monday, the Harvard Business school posted a “conversation starter” study on gender differences in Twitter use.  The authors found that “men have 15% more followers than women” and “an average man is almost twice as likely to follow another man than a woman.”  The authors suggest, without empirical data, that men find the content produced by women less compelling “because of a lack of photo sharing.”  Is everyone else offended by this base characterization?

As it happens, the study has serious flaws.  I’d like to point those out, and then suggest an alternative method for addressing these questions.  Let’s start by talking methods.  This study is a survey; using a random sample of 300,000 Twitter users, the authors attempt to draw population-level inferences about “friending” behavior in Twitter.

When conducting a population survey, researchers collect a sample and attempt to use that sample to draw inferences about a population.  The difference between the “sampled” population value and the “true” population value is known as error.  Survey error (MSE) has two components: sampling error and non-sampling error.  We are most familiar with sampling error; it is the differences between the “sample” value and the “true” value attributable to the sample selection.  Non-sampling error comprises all other error non-attributable to sampling error, such as data entry error, instrument error, etc.

For the purpose of this analysis, we are going to focus primarily on sampling error.  At the study sample size of 300,000, there is very little sampling error in an infinite population.  While we generally associate a large sample size with better quality data because of this small sampling error, there are two caveats.  First, above a certain sample size, say 20,000, there is little marginal gain in the addition of sample.  The difference between an n of 500 and an n of 1000 is vast, but the difference between an n of 20,000 and an n of 40,000 is much smaller due to the properties of the normal distribution.

On paper, a larger n is always better; here is the second caveat.  When dealing with very large samples, confidence intervals used to determine significance are smaller – meaning even the most minute differences become “significant.”  Furthermore, discovery of influential data is more difficult, as those data may be sufficient in number (i.e., a pattern emerges in influential data) to influence the distribution. As any Twitter user with a public profile knows, there are certainly some “patterns” that emerge in follower behavior.

Let us revisit the purpose of the survey, which is to use a sample to draw inferences about a population with as little total error as possible.  The goal is to not achieve significant differences on wild hypotheses, it is to collect good data that represents a population.  To achieve this goal, survey designers expend a lot of effort understanding their populations, defining their sample, and working to achieve high data quality (while keeping costs under control).

Let’s say that I wanted to know the 2008 income of everyone over 18 born in my city.  So I go down to city hall, I ask for the names of everyone who was born in my city before 1991.  I then take this very large list, and cross-reference it with my magical 2008 tax records, and produce a wonderful study.  Can you spot some problems with the data?  At first, you might point out that not everyone over 18 born in my city earns an income.  Ok, that’s fine – I want to know that.  Now here’s the real problem: my city stared keeping records in 1830, meaning well over half of the people in my sample are dead, and they report no income.  Now I’ve got some highly influential data that actually looks “normal” due to attrition.

Let’s consider what we know about Twitter.  If we believe Nielsen, about 60% of people who create Twitter accounts abandon them within a month.  And if we believe the fair and balanced news organization Fox News, Twitter has a spam problem (Ok, anyone who has a public profile knows that).   What might these trends tell us about our population?  First, there will be a large cluster of inactive (attrition) users.  Second, there will likely be a large cluster of users who do not follow anyone, or follow a very small number of people (characteristic of attrited users).  Finally, since following is non-reciprocal, these attrited users (and active users) likely have their follower numbers inflated by Twitter spammers.

What do the HBS numbers tell us?  The authors find that the mean number of tweets/user is 26, but the median is 1 and 75th percentile is 4 tweets.  This indicates a highly non-normal distribution (it most likely approximates a bimodal distribution); that there are a large number of users with 0 or 1 tweets (50% of the sample – and 75% of the sample have less than 4 tweets).  This is indicative that a large portion of the sample is inactive.  (Of course, a number of these accounts could be “follower” accounts (i.e. people who do not post but follow), but I would argue this would constitute a small portion of the population). This provides good support for my first point.

My second point, non-follower data, is not addressed by the study.  They do not present information regarding the percentage of users who do not follow back, instead presenting an odds ratio that would hide the distribution of followers.  I would guess that at least 40% of the sample does not follow a user (or follows only “suggested” users).  My third point, that more people would be followed, seems to be upheld, as 80% of the sample has at least one follower.  There is likely some spam inflation there, and information about the distribution would tell us a lot.

As we can see, all signs point to low data quality, which casts all of the hypotheses and findings in serious doubt.  Just because a sample is large, and significance can be easily achieved, it doesn’t mean that data quality is good.  Unfortunately, it appears that the Harvard authors have made the error I describe in my income study – yes, they’ve collected a lot of people, but the failed to see who had died.  What good is an inference about a population if it is heavily influenced by bad data?  Don’t we actually want to know what real users are doing?

Beyond these data quality problems, there is also an issue with the gender classification; the authors rely on a corpus of names to predict gender of users.  As each name is a prediction, there is an error component associated with each name classification.  This error component must be taken into account as a function of the total variance component – meaning all of the things that looked significant may not actually be significant.

Since this is a “discussion,” I’d like to propose a method to re-run the study with better data quality (but larger standard errors).  The two main problems that will be addressed are compensation for attrition and gender classification.  To deal with Twitter attrition, let us first define it.  If we follow Neilsen’s numbers, a Twitter user that has posted at least once at >30 days and <30 days has a decent chance of being an active user.  We may want to make this criteria more lenient – perhaps just requiring one post in the last 30 days.  Either way, we must define a criteria to decide who is an active user (and this definition must be informed by data and theory).

The problem with gender is a little more difficult.  I don’t spend a lot of time in the TREC community so I’m not sure how good automated techniques are, so I’m going to propose human classification.  The most efficient way to do this is with Mechanical Turk.  Turkers could be shown a profile and asked to decide the gender of the profile owner; you’d repeat with a different rater to get an estimate of reliability.  Your guess is as good as mine about agreement – I’m generally skeptical of ethnicity ratings by third parties, but I tend to think that gender can be reasonably assessed.  Update: @yardi brings up a good point regarding brand/persona/promotional/shared accounts.  My (too simple) answer is exclusion.  If we’re truly interested in this gender question, then non-gendered accounts fit an a priori exclusion critera.  My gut instinct is that in a population sample, we would see low incidence of these accounts, and they could be analyzed separately to see how they would affect our data.

So the study would be simple – collect a first-stage sample of profiles and assess if they meet the activity criteria (this can be done automatically).  Then run a second stage random sample on the eligibles and send them to Mechanical Turk.  You could send 3000 profiles to MTurk and have them assessed, with a goal of ending up with 2400 profiles, giving you +/-2% at p<.05.  Of course, all of the “friend lists” would also have to be gender coded, so if you have an average of 10 friends you’re looking at 24,000 extra codings (minus overlap).  If we include overlap and say we’ll have 25,000 unique profile, and each profile has to be rated twice, at .01 a HIT we’re looking at a total price of 500.00.  Of course, if we pull our sample back we can reduce this cost substantially.

There are a couple of questions: First, we can’t really say how much better humans will preform at gender-coding until we run a comparison to the machine-coded results.  My gut is that humans will preform at a higher level of accuracy, but there is still a variance component with the classification.  We also don’t know what kind of bias we introduce by cutting out “follower” profiles.  I don’t know how many of these unique profiles would show up in a population survey, but it is an open question.  And what about the findings, how would they change?  My gut is that a lot of these “stunning” findings would go away, and we’d see greater gender homogeny in “following” behavior.  “Follower” behavior would still be influenced by spam, so it might be useful to assign a spam attribute to profiles to be used as a covariate (you could have MTers code them, run them through spamassassin to get a naive score, or simply use standard techniques to find influential data).

The important takeaways from this discussion is that “bigger” is not always better with social data, that data should be looked at critically before running analysis (using existing information and theory), and data that wildly contravenes existing findings should always be re-run to produce robust estimates.

Final note: The authors state that “On a typical online social network, most of the activity is focused around women – men follow content produced by women they do and do not know, and women follow content produced by women they know.”  Mike Thelwall’s (2008) large scale analysis of Myspace friending behaviors found that while females tend to friend females, there was not a significant gender effect for males.  In Mayer and Puller’s (2008) analysis of Facebook, they found that same gender was a significant predictor of friendship (in a potentially overfitted model).   Overall, studies commonly find gender differences regarding SNS/internet use; females are generally found to use communicative tools with greater intensity. (e.g. Joinson, 2008; Lenhart & Madden, 2007; Jones et al., 2009)

References:

Joinson, A. N.  (2008).  Looking at, looking up or keeping up with people?: motives and use of facebook.  In CHI ’08: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, New York, NY, USA, 2008 (pp. 1027-1036).  ACM.

Jones, S., Johnson-Yale, C., Millermaier, S., and Perez, F. S.  (2009).  U.S. College Students’ Internet Use: Race, Gender and Digital Divides.  Journal of Computer-Mediated Communication, 14(2), 244-264.

Lenhart, A. and Madden, M.  (April 18, 2007).  Teens, Privacy and Online Social Networks: How teens manage their online identities and personal information in the age of MySpace.  Pew Internet and American Life Project.  Retrieved March 9, 2008 from http://www.pewinternet.org/PPF/r/211/report_display.asp.

Mayer, A. and Puller, S. L.  (2008).  The old boy (and girl) network: Social network formation on university campuses.  Journal of Public Economics, 92(1-2), 329-347.

Thelwall, M.  (2008).  Social networks, gender and friending: An analysis of MySpace member profiles.  Journal of the American Society for Information Science and Technology, 59(8):1321–1330


1
Jun 09

Second Class Citizens on the social web

Over the past few days, I’ve seen a few blog posts referencing various “studies” that claim that young people don’t use Twitter.  Apparently, this is a problem.

As reported on CNET, “99 percent of 18- to 24-year-olds have profiles on social networks, only 22 percent use Twitter, according to a new survey from Pace University and the Participatory Media Network.”  Never mind that it’s not particularly fair to compare a sector to a single product, what does the study’s methodology look like?  More bad news – the question was posed to 200 members of a volunteer panel.  A small, convenience sample provides very little inferential power; it is just as likely that this survey’s statistics looked like Pew’s numbers by chance occurrence.  However, my main goal here isn’t to rail against small or convenience samples being reported as representative – this is a pervasive problem and there’s not much that Unit Structures can do.

Rather, I’d like to question this problematization of the “fact” that Twitter’s users aren’t young.  The inherent bias in media coverage of social software is that social software is for “the young.”  If we look at the history of social networking websites, we find mixed evidence to support this theory.  For example, danah boyd’s ethnography (and my personal recollection) of Friendster was that it was a place for the late-twenties and thirty-something set.  If it weren’t for bonehead moves on behalf of Friendster’s staff, we might still be using the service.  LinkedIn, a popular and pervasive social network, has existed with an older skew for years.  Facebook’s growth after opening up?  It has been primarily dominated by older users.

This is not to say that young people aren’t important.  They are the lifeblood of a number of popular social networks, including large communities and countless smaller ones you’ll never hear about.  But why do we accept youth adoption as social fact ensuring community success?  One reason is surely that young people are trendsetters.  However, this theory of “trending” is an artifact of a pre-digital age, in which exclusivity and first-mover capitalization were required in the context of a production cycle.  What is a trend in the digital age, if I can have a perfect replica of what the kids have, streamed via cable modem?  Another reason is that young people are more connected.  There is truth here; young people are disproportionately more connected than older people, but this is also changing.

It might help to think of connectivity in two ways.  The first is traditional connectivity – the ability to access the internet.  If you look at Pew’s numbers[1], you’ll see that older users are less connected.  However, if you cut off the tail of the distribution, and consider users 60 and younger – you still find that 71% of those age 60 or younger have connectivity.  Users in their 40′s report connectivity rates in the 80′s, about 10% less than teenagers.  For a large segment of users, we actually find that teens aren’t that much more connected.

Lets consider a second notion of connectivity, which is the saturation of your online connections with friends or contacts.  Here, teens have old people beat hands down.  Teens interact more with their friends online, they manage their lives online – overall, they are more connected to their personal networks through computers.  Revisiting our first definition of connectivity, we can see that the explanation for the second definition must be heavily cultural, and not only technical.  That is, this high saturation of connectivity is because of norms within younger users, and not just because they’re so much more connected than adults.

So what does this mean for Twitter?  If Twitter’s users truly do skew older (and the difference between youngsters ’18-24′ and oldsters ’24-35′ was ns in Pew’s study), then Twitter benefits from what I think of as an identity-participation shift.  My basic theory argues that as social norms and personal networks reward non-deceptive identities, people are more likely to share and participate in online communities.  Put another way, as it becomes more OK to share (it stops being weird to use your real name on your Facebook profile), and more of your friends do it, you’re more likely to extend this type of participation to other parts of the web.  Notably, the driving force of this theory is simple connectivity, which establishes the preconditions for the social shifts.  For Twitter, there is a whole new old generation of web users coming online and embracing social software – because it is now socially OK to do so, because they have the connectivity and connections they need to feel worthwhile sharing, etc.  And it just so happens that a lot of these people seem to have found Twitter.

The core problem here is that we’re treating older users as second-class citizens on the social web.  I think that Twitter, and Facebook are going to serve as very useful testbeds to bat down this stereotype.  In fact, I think we may see the older user emerge as the truly first-class citizen on the social web.  As these users tend to be more settled, and going through less transitions that lead to upheval of the personal social networks, they may be more long-time users, less prone to “delete and move on” from one social site to the next.  Of course, these ideas need to be tested, and I’m right now embarking on a long-term project to explore questions like these.  If you are an older user of social software and might like to participate in my research interviews, keep watching this space for announcements.

[1] Jones, S. and Fox, S.  (January 28, 2009).  Generations Online in 2009.  Pew Internet and American Life Project.  Retrieved January 28, 2009 from http://www.pewinternet.org/PPF/r/275/source/rss/report_display.asp.