A few days ago, I stumbled upon a post from the blog Business Insider that asked “Why Is Twitter More Popular With Black People Than White People?” Drawing on data from Edison Research, the writer proposed a number of explanations for why “black people represent 25% of Twitter users, roughly twice their share of the population in general.” This factoid has now been reported by the New York Times, the San Francisco Chronicle, The Atlantic, as well as a number of prominent blogs. It’s also going viral in the Twittersphere.
I’m loathe to trust bloggers getting survey data right, so I requested a copy of the report from Edison Research (available here). At first glance, the data looks good – the research was conducted by Arbitron, it employs a landline/mobile random digit dialing (RDD) frame, with about 1,750 people age 12 and older interviewed. “National probability” studies of this sort are generally considered valid for population estimates.
Without getting into too much detail, a study’s validity is dependent on the sampling method and sample size (among many other things). In terms of method, RDD is not a true equal-probability of selection method, but both industry and academia consider it “good enough” when the sample is weighted to known totals. As for size, a sample of 1750 people allows us to make claims about a large population at an error rate of about plus or minus 3 percent.
Let’s cut to the chase: Where did the Edison Research interpretation go wrong? In the report, Tom Webster states:
The percentage of Twitter users who are African-American currently stands at roughly 25%, which is approximately double the percentage of African-Americans in the current U.S. population. Indeed, many of the “trending topics” on Twitter on a typical day are reflective of African-American culture, memes and topics.
From this, we are to believe that of all Twitter users, 25% are African-American. Not only is this surprising considering current population estimates, but also because Twitter is a global service. Let’s explore how Edison got to this 25 percent number (conveniently rounded up from 24 percent).
In the phone interview, Edison asked all respondents 12+ (n=1750) if they “currently ever use[d] Twitter.” 7% of respondents said yes, approximately 123 people. Of those 123, Edison then asked how often they used Twitter. 85% of those respondents (105 people) indicated they used Twitter at least once a month, and were thus recoded as “Monthly Twitter Users.” Herein lies the problem: It was from these 105 individuals (not the 1750 total respondents) that Edison based its estimates of Twitter use.
Let’s return to sampling error. Because random samples are asymptotically efficient, a sample of 1750 can speak to a population of hundreds of millions almost as well as a sample of 2000, 3000, or even 5000. But a sample of 105 people speaking to the very large userbase (self reported at 100 million) of Twitter? Not so efficient. The margins of error are approximately +/- 10% at an alpha of .05, +/- 12.5 at an alpha of .01. And these margins assume true equal probability of selection, and no nonresponse bias. With weighting for proportionality, it is almost certain these margins increase substantially (1).
Let’s explore what this means practically. First, Edison Research can’t speak to all Twitter users, because all Twitter users weren’t potentially included in the sample. Edison can, however, speak to USA Twitter use, from its sample of 105 monthly users. If we assume that only 5 million Twitter users in the USA use the service every month, Edison is still using 105 people to speak about these 5 million people (the margins of error don’t change). Unfortunately, this is highly unreliable.
The American Community Survey finds that approximately 13.1% of the US population self identifies as Black or African American. At an alpha of .05, the range of potentially true estimates of African-American Twitter use in the US is actually anywhere from 14% to 34%. At an alpha of .01, this estimate ranges anywhere from 11% to almost 38%, causing us to reject the hypothesis that the estimate is not attributable to sampling error or random effects. If we then include weights in our estimates of error (likely the case because Edison’s sample over-represents people under 24), the growth in error causes us to fail to reject the null hypothesis at the .05 level as well. We just can’t trust that the demographics of Twitter actually do vary from current population estimates.
Is Twitter “disproportionately” African American, White, Hispanic, or Green? The simple fact is that from this data, we can’t say so with confidence. If Edison had been a little more forthcoming with their sample sizes, it might be more likely that the blogger/journalist who reported these data would have sensed something wrong. But I wouldn’t bank on it, because it seems like Edison Research was pushing this spin from the get-go.
A final note: as I was researching/considering this piece, it was interesting to see the “spin” being placed on this “fact” around the blogosphere. Of course, you had your standard racist comments/tweets of the “there goes the neighborhood” variety, but there also appeared to be a large swath of users who were heralding this as a point of pride. Before you examine my subconscious racist motives for examining this question, please just know I like getting surveys right. And if Edison wanted to get this right, they could start by giving us a topline cross-tab of ethnicity, Twitter use, and the respective margins of error.
Ugh, footnotes on a blog!
1. Research consistently demonstrates a negatively correlated relationship between age and nonresponse; young users are more likely to under-respond, increasing their odds of being weighted in a population (and increasing their margins of error). Research is mixed on the relationship between ethnicity and nonresponse.