On Monday, the Harvard Business school posted a “conversation starter” study on gender differences in Twitter use. The authors found that “men have 15% more followers than women” and “an average man is almost twice as likely to follow another man than a woman.” The authors suggest, without empirical data, that men find the content produced by women less compelling “because of a lack of photo sharing.” Is everyone else offended by this base characterization?
As it happens, the study has serious flaws. I’d like to point those out, and then suggest an alternative method for addressing these questions. Let’s start by talking methods. This study is a survey; using a random sample of 300,000 Twitter users, the authors attempt to draw population-level inferences about “friending” behavior in Twitter.
When conducting a population survey, researchers collect a sample and attempt to use that sample to draw inferences about a population. The difference between the “sampled” population value and the “true” population value is known as error. Survey error (MSE) has two components: sampling error and non-sampling error. We are most familiar with sampling error; it is the differences between the “sample” value and the “true” value attributable to the sample selection. Non-sampling error comprises all other error non-attributable to sampling error, such as data entry error, instrument error, etc.
For the purpose of this analysis, we are going to focus primarily on sampling error. At the study sample size of 300,000, there is very little sampling error in an infinite population. While we generally associate a large sample size with better quality data because of this small sampling error, there are two caveats. First, above a certain sample size, say 20,000, there is little marginal gain in the addition of sample. The difference between an n of 500 and an n of 1000 is vast, but the difference between an n of 20,000 and an n of 40,000 is much smaller due to the properties of the normal distribution.
On paper, a larger n is always better; here is the second caveat. When dealing with very large samples, confidence intervals used to determine significance are smaller – meaning even the most minute differences become “significant.” Furthermore, discovery of influential data is more difficult, as those data may be sufficient in number (i.e., a pattern emerges in influential data) to influence the distribution. As any Twitter user with a public profile knows, there are certainly some “patterns” that emerge in follower behavior.
Let us revisit the purpose of the survey, which is to use a sample to draw inferences about a population with as little total error as possible. The goal is to not achieve significant differences on wild hypotheses, it is to collect good data that represents a population. To achieve this goal, survey designers expend a lot of effort understanding their populations, defining their sample, and working to achieve high data quality (while keeping costs under control).
Let’s say that I wanted to know the 2008 income of everyone over 18 born in my city. So I go down to city hall, I ask for the names of everyone who was born in my city before 1991. I then take this very large list, and cross-reference it with my magical 2008 tax records, and produce a wonderful study. Can you spot some problems with the data? At first, you might point out that not everyone over 18 born in my city earns an income. Ok, that’s fine – I want to know that. Now here’s the real problem: my city stared keeping records in 1830, meaning well over half of the people in my sample are dead, and they report no income. Now I’ve got some highly influential data that actually looks “normal” due to attrition.
Let’s consider what we know about Twitter. If we believe Nielsen, about 60% of people who create Twitter accounts abandon them within a month. And if we believe the fair and balanced news organization Fox News, Twitter has a spam problem (Ok, anyone who has a public profile knows that). What might these trends tell us about our population? First, there will be a large cluster of inactive (attrition) users. Second, there will likely be a large cluster of users who do not follow anyone, or follow a very small number of people (characteristic of attrited users). Finally, since following is non-reciprocal, these attrited users (and active users) likely have their follower numbers inflated by Twitter spammers.
What do the HBS numbers tell us? The authors find that the mean number of tweets/user is 26, but the median is 1 and 75th percentile is 4 tweets. This indicates a highly non-normal distribution (it most likely approximates a bimodal distribution); that there are a large number of users with 0 or 1 tweets (50% of the sample – and 75% of the sample have less than 4 tweets). This is indicative that a large portion of the sample is inactive. (Of course, a number of these accounts could be “follower” accounts (i.e. people who do not post but follow), but I would argue this would constitute a small portion of the population). This provides good support for my first point.
My second point, non-follower data, is not addressed by the study. They do not present information regarding the percentage of users who do not follow back, instead presenting an odds ratio that would hide the distribution of followers. I would guess that at least 40% of the sample does not follow a user (or follows only “suggested” users). My third point, that more people would be followed, seems to be upheld, as 80% of the sample has at least one follower. There is likely some spam inflation there, and information about the distribution would tell us a lot.
As we can see, all signs point to low data quality, which casts all of the hypotheses and findings in serious doubt. Just because a sample is large, and significance can be easily achieved, it doesn’t mean that data quality is good. Unfortunately, it appears that the Harvard authors have made the error I describe in my income study – yes, they’ve collected a lot of people, but the failed to see who had died. What good is an inference about a population if it is heavily influenced by bad data? Don’t we actually want to know what real users are doing?
Beyond these data quality problems, there is also an issue with the gender classification; the authors rely on a corpus of names to predict gender of users. As each name is a prediction, there is an error component associated with each name classification. This error component must be taken into account as a function of the total variance component – meaning all of the things that looked significant may not actually be significant.
Since this is a “discussion,” I’d like to propose a method to re-run the study with better data quality (but larger standard errors). The two main problems that will be addressed are compensation for attrition and gender classification. To deal with Twitter attrition, let us first define it. If we follow Neilsen’s numbers, a Twitter user that has posted at least once at >30 days and <30 days has a decent chance of being an active user. We may want to make this criteria more lenient – perhaps just requiring one post in the last 30 days. Either way, we must define a criteria to decide who is an active user (and this definition must be informed by data and theory).
The problem with gender is a little more difficult. I don’t spend a lot of time in the TREC community so I’m not sure how good automated techniques are, so I’m going to propose human classification. The most efficient way to do this is with Mechanical Turk. Turkers could be shown a profile and asked to decide the gender of the profile owner; you’d repeat with a different rater to get an estimate of reliability. Your guess is as good as mine about agreement – I’m generally skeptical of ethnicity ratings by third parties, but I tend to think that gender can be reasonably assessed. Update: @yardi brings up a good point regarding brand/persona/promotional/shared accounts. My (too simple) answer is exclusion. If we’re truly interested in this gender question, then non-gendered accounts fit an a priori exclusion critera. My gut instinct is that in a population sample, we would see low incidence of these accounts, and they could be analyzed separately to see how they would affect our data.
So the study would be simple – collect a first-stage sample of profiles and assess if they meet the activity criteria (this can be done automatically). Then run a second stage random sample on the eligibles and send them to Mechanical Turk. You could send 3000 profiles to MTurk and have them assessed, with a goal of ending up with 2400 profiles, giving you +/-2% at p<.05. Of course, all of the “friend lists” would also have to be gender coded, so if you have an average of 10 friends you’re looking at 24,000 extra codings (minus overlap). If we include overlap and say we’ll have 25,000 unique profile, and each profile has to be rated twice, at .01 a HIT we’re looking at a total price of 500.00. Of course, if we pull our sample back we can reduce this cost substantially.
There are a couple of questions: First, we can’t really say how much better humans will preform at gender-coding until we run a comparison to the machine-coded results. My gut is that humans will preform at a higher level of accuracy, but there is still a variance component with the classification. We also don’t know what kind of bias we introduce by cutting out “follower” profiles. I don’t know how many of these unique profiles would show up in a population survey, but it is an open question. And what about the findings, how would they change? My gut is that a lot of these “stunning” findings would go away, and we’d see greater gender homogeny in “following” behavior. “Follower” behavior would still be influenced by spam, so it might be useful to assign a spam attribute to profiles to be used as a covariate (you could have MTers code them, run them through spamassassin to get a naive score, or simply use standard techniques to find influential data).
The important takeaways from this discussion is that “bigger” is not always better with social data, that data should be looked at critically before running analysis (using existing information and theory), and data that wildly contravenes existing findings should always be re-run to produce robust estimates.
Final note: The authors state that “On a typical online social network, most of the activity is focused around women – men follow content produced by women they do and do not know, and women follow content produced by women they know.” Mike Thelwall’s (2008) large scale analysis of Myspace friending behaviors found that while females tend to friend females, there was not a significant gender effect for males. In Mayer and Puller’s (2008) analysis of Facebook, they found that same gender was a significant predictor of friendship (in a potentially overfitted model). Overall, studies commonly find gender differences regarding SNS/internet use; females are generally found to use communicative tools with greater intensity. (e.g. Joinson, 2008; Lenhart & Madden, 2007; Jones et al., 2009)
Joinson, A. N. (2008). Looking at, looking up or keeping up with people?: motives and use of facebook. In CHI ’08: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, New York, NY, USA, 2008 (pp. 1027-1036). ACM.
Jones, S., Johnson-Yale, C., Millermaier, S., and Perez, F. S. (2009). U.S. College Students’ Internet Use: Race, Gender and Digital Divides. Journal of Computer-Mediated Communication, 14(2), 244-264.
Lenhart, A. and Madden, M. (April 18, 2007). Teens, Privacy and Online Social Networks: How teens manage their online identities and personal information in the age of MySpace. Pew Internet and American Life Project. Retrieved March 9, 2008 from http://www.pewinternet.org/PPF/r/211/report_display.asp.
Mayer, A. and Puller, S. L. (2008). The old boy (and girl) network: Social network formation on university campuses. Journal of Public Economics, 92(1-2), 329-347.
Thelwall, M. (2008). Social networks, gender and friending: An analysis of MySpace member profiles. Journal of the American Society for Information Science and Technology, 59(8):1321–1330