Posts Tagged: surveys


3
May 10

On Twitter and Ethnicity

A few days ago, I stumbled upon a post from the blog Business Insider that asked “Why Is Twitter More Popular With Black People Than White People?” Drawing on data from Edison Research, the writer proposed a number of explanations for why “black people represent 25% of Twitter users, roughly twice their share of the population in general.”  This factoid has now been reported by the New York Times, the San Francisco Chronicle, The Atlantic, as well as a number of prominent blogs.  It’s also going viral in the Twittersphere.

I’m loathe to trust bloggers getting survey data right, so I requested a copy of the report from Edison Research (available here).  At first glance, the data looks good – the research was conducted by Arbitron, it employs a landline/mobile random digit dialing (RDD) frame, with about 1,750 people age 12 and older interviewed.  “National probability” studies of this sort are generally considered valid for population estimates.

Without getting into too much detail, a study’s validity is dependent on the sampling method and sample size (among many other things).  In terms of method, RDD is not a true equal-probability of selection method, but both industry and academia consider it “good enough” when the sample is weighted to known totals.  As for size, a sample of 1750 people allows us to make claims about a large population at an error rate of about plus or minus 3 percent.

Let’s cut to the chase: Where did the Edison Research interpretation go wrong?  In the report, Tom Webster states:

The percentage of Twitter users who are African-American currently stands at roughly 25%, which is approximately double the percentage of African-Americans in the current U.S. population. Indeed, many of the “trending topics” on Twitter on a typical day are reflective of African-American culture, memes and topics.

From this, we are to believe that of all Twitter users, 25% are African-American.  Not only is this surprising considering current population estimates, but also because Twitter is a global service.  Let’s explore how Edison got to this 25 percent number (conveniently rounded up from 24 percent).

In the phone interview, Edison asked all respondents 12+ (n=1750) if they “currently ever use[d] Twitter.”  7% of respondents said yes, approximately 123 people.  Of those 123, Edison then asked how often they used Twitter.  85% of those respondents (105 people) indicated they used Twitter at least once a month, and were thus recoded as “Monthly Twitter Users.”  Herein lies the problem: It was from these 105 individuals (not the 1750 total respondents) that Edison based its estimates of Twitter use.

Let’s return to sampling error.  Because random samples are asymptotically efficient, a sample of 1750 can speak to a population of hundreds of millions almost as well as a sample of 2000, 3000, or even 5000.  But a sample of 105 people speaking to the very large userbase (self reported at 100 million) of Twitter? Not so efficient.  The margins of error are approximately +/- 10% at an alpha of .05, +/- 12.5 at an alpha of .01.  And these margins assume true equal probability of selection, and no nonresponse bias.  With weighting for proportionality, it is almost certain these margins increase substantially (1).

Let’s explore what this means practically.  First, Edison Research can’t speak to all Twitter users, because all Twitter users weren’t potentially included in the sample.  Edison can, however, speak to USA Twitter use, from its sample of 105 monthly users.  If we assume that only 5 million Twitter users in the USA use the service every month, Edison is still using 105 people to speak about these 5 million people (the margins of error don’t change).  Unfortunately, this is highly unreliable.

The American Community Survey finds that approximately 13.1% of the US population self identifies as Black or African American.  At an alpha of .05, the range of potentially true estimates of African-American Twitter use in the US is actually anywhere from 14% to 34%.  At an alpha of .01, this estimate ranges anywhere from 11% to almost 38%, causing us to reject the hypothesis that the estimate is not attributable to sampling error or random effects.  If we then include weights in our estimates of error (likely the case because Edison’s sample over-represents people under 24), the growth in error causes us to fail to reject the null hypothesis at the .05 level as well.  We just can’t trust that the demographics of Twitter actually do vary from current population estimates.

Is Twitter “disproportionately” African American, White, Hispanic, or Green?  The simple fact is that from this data, we can’t say so with confidence.  If Edison had been a little more forthcoming with their sample sizes, it might be more likely that the blogger/journalist who reported these data would have sensed something wrong.  But I wouldn’t bank on it, because it seems like Edison Research was pushing this spin from the get-go.

A final note: as I was researching/considering this piece, it was interesting to see the “spin” being placed on this “fact” around the blogosphere.  Of course, you had your standard racist comments/tweets of the “there goes the neighborhood” variety, but there also appeared to be a large swath of users who were heralding this as a point of pride.  Before you examine my subconscious racist motives for examining this question, please just know I like getting surveys right.  And if Edison wanted to get this right, they could start by giving us a topline cross-tab of ethnicity, Twitter use, and the respective margins of error.

Ugh, footnotes on a blog!

1. Research consistently demonstrates a negatively correlated relationship between age and nonresponse; young users are more likely to under-respond, increasing their odds of being weighted in a population (and increasing their margins of error).  Research is mixed on the relationship between ethnicity and nonresponse.


1
Sep 09

The trouble with Internet surveys

Gary Langer, the director of polling at ABC News, shares the bad news regarding Internet surveys.

In the most extensive such analysis to date, David Yeager and Prof. Jon Krosnick compared seven non-random internet surveys with two others based instead on random or so-called probability samples. The non-probability internet surveys were less accurate, and customary adjustments did not uniformly improve them.

While the random-sample surveys were “consistently highly accurate,” the internet surveys based on self-selected or “opt-in” panels “were always less accurate, on average, than probability sample surveys, and were less consistent in their level of accuracy,” the researchers said. Further, they said, adjusting these samples to known population values had no effect on accuracy (and in one case even worsened it) as often as that process, known as weighting, improved it.

Also noteworthy:

While this paper is the first to evaluate the subject in such detail, intimations of these problems were posted in a blog item this summer by Reg Baker, COO of the research firm Market Strategies International. Estimates of smoking prevalence were similar in three probability samples, he reported, but less similar – with variation of as many as 14 points – in 17 opt-in online panels. In such panels, he said, “the results we get for any given study are highly dependent (and mostly unpredictable) on the panel we use. This is not good news.”

Yeager and Krosnick, meanwhile, provide one more eye-opener: The average highest weight for any one respondent across the opt-in online samples was 30 – one respondent, that is, standing for the equivalent of 30 in the full dataset. (And one went as high as 70.) The highest weights in the two probability samples, by contrast, were 5 and 8.

Nothing new or groundbreaking here, and yes, a little inside baseball, but relevant in the light of all of these web surveys showing that “Teens don’t tweet.”  First, convenience-sampled web surveys can’t offer standard errors, and the weighting process that produces errors is highly susceptible to inflation in areas where data are sparse.  This sparseness commonly occurs when studying the behavior of a low-response population such as young people, and is multiplied when studying an early-adopting phenomenon like Tweeting.

Langer’s blog is a worthwhile resource if you’re interested in survey methods.  And I hope to resume blogging – updating my syllabus, posting some recent papers, etc. – when I get a spare moment.

via Study Finds Trouble for Internet Surveys – The Numbers.