Noticed


24
May 10

Social Network Analysis in R

Last week, I had the pleasure of attending the 2010 Political Networks Conference.  The first day of the conference included workshop sessions led by Matthew Jackson and Carter Butts, two eminent networks researchers.  Both are now online.

The lecture by Carter Butts will be of particular interest to individuals looking to use R for social network analysis.  Butts is the author of a number of network analysis packages for R (many of which come bundled in the amazing statnet package).

Network Analysis with statnet for Individual, Organizational, and International Relations Applications by Carter Butts, University of California-Irvine

Advanced Network Analysis by Matthew O. Jackson, Stanford University

If you find these materials useful, you might also wish to check out Steve Goodreau and David Hunter’s tutorial Advanced Social Network Analysis Using R and statnet available at the Complexity and Social Networks blog.


6
May 10

What News Organizations Share With Facebook

Last month, Facebook announced a number of new features, including “personalization” (which generated significant controversy) and “social plugins.”  The plugins are described as follows:

Social plugins let you see what your friends have liked, commented on or shared on sites across the web. All social plugins are extensions of Facebook and are specifically designed so none of your data is shared with the sites on which they appear.

According to Mashable, over 50,000 plugins have been installed since the rollout.  Seeing one’s Facebook friends suddenly start showing up on third party sites has raised privacy concerns, which Facebook quickly addressed in a blog post, stating “Because [third party sites] have given Facebook this “real estate” on their sites, they do not receive or interact with the information that is contained or transmitted there.”

Here’s the rub.  By giving “real estate” to Facebook, third party sites have created a one-way mirror, allowing Facebook to peer in on what we’re doing.  If you’re logged in to Facebook, and you visit a third party page with a social plugin, Facebook knows where you’ve been.  The mechanism is simple – cookies and referrals – and it will allow Facebook to create personalized behavioral profiles that, combined with the information we articulate in Facebook, will be tremendously valuable.

To explore the privacy implications of Facebook’s social plugins, I visited the websites of the top 15 U.S. online news destinations (based on some 2009 Nielsen data), and a few honorable mentions.  I then selected a news story from the front page, and loaded the page.  I checked to see if social plugins were enabled, if the Facebook cookie was called, and if the referring page was sent to Facebook (basically, did the site identify you to Facebook, and share the page you were on).

I found that of the top 15 online news destinations, 9 were sharing information with Facebook (MSNBC, CNN, CBS, ABC, Fox News, Washington Post, and the Tribune, McClatchy and Gannett Companies[1]).  Notably, The New York Times, BBC, Yahoo News, AOL News, and Google News did not share information.  I then checked a few favorites of mine: NPR (yes), Drudge (no), Huffington Post (yes), and Politico (no).  I’ve included all of the details on a spreadsheet, embedded below or html version.

According to Nielsen, the 9 news organizations sharing information with Facebook account for over 177,161,000 monthly unique visitors.  Granted, not all of these views will go to social plugin enabled pages, and not all visitors will be logged-in Facebook users.  But with 400 million users, it is safe to assume that a substantial proportion of that information will go to Facebook.  If you stay logged in to Facebook, it is increasingly likely that Facebook will know what news you read.

My beef here isn’t necessarily with Facebook; Google and other behavioral-targeting firms have very similar SOP’s.  Rather, I’m uncomfortable that so many news organizations felt comfortable sharing the news-reading behaviors of their customers that just so happen to be logged in to Facebook.  And really, what do they get for trading this tremendously valuable asset?  I get to see that a random friend liked an article?

I think it is time that someone wrote a Firefox plugin that specifically manages the Facebook cookie, only allowing it to be accessed when someone is on Facebook proper.  Clearly, we can’t trust third parties – even reputable news organizations – to protect our data.  Here’s the spreadsheet from my analysis:

Note: For media conglomerates (Tribune, McClatchy, Gannett) I visited the flagship outlet (Chicago Trib, Sac Bee, and USA Today, respectively).


3
May 10

On Twitter and Ethnicity

A few days ago, I stumbled upon a post from the blog Business Insider that asked “Why Is Twitter More Popular With Black People Than White People?” Drawing on data from Edison Research, the writer proposed a number of explanations for why “black people represent 25% of Twitter users, roughly twice their share of the population in general.”  This factoid has now been reported by the New York Times, the San Francisco Chronicle, The Atlantic, as well as a number of prominent blogs.  It’s also going viral in the Twittersphere.

I’m loathe to trust bloggers getting survey data right, so I requested a copy of the report from Edison Research (available here).  At first glance, the data looks good – the research was conducted by Arbitron, it employs a landline/mobile random digit dialing (RDD) frame, with about 1,750 people age 12 and older interviewed.  “National probability” studies of this sort are generally considered valid for population estimates.

Without getting into too much detail, a study’s validity is dependent on the sampling method and sample size (among many other things).  In terms of method, RDD is not a true equal-probability of selection method, but both industry and academia consider it “good enough” when the sample is weighted to known totals.  As for size, a sample of 1750 people allows us to make claims about a large population at an error rate of about plus or minus 3 percent.

Let’s cut to the chase: Where did the Edison Research interpretation go wrong?  In the report, Tom Webster states:

The percentage of Twitter users who are African-American currently stands at roughly 25%, which is approximately double the percentage of African-Americans in the current U.S. population. Indeed, many of the “trending topics” on Twitter on a typical day are reflective of African-American culture, memes and topics.

From this, we are to believe that of all Twitter users, 25% are African-American.  Not only is this surprising considering current population estimates, but also because Twitter is a global service.  Let’s explore how Edison got to this 25 percent number (conveniently rounded up from 24 percent).

In the phone interview, Edison asked all respondents 12+ (n=1750) if they “currently ever use[d] Twitter.”  7% of respondents said yes, approximately 123 people.  Of those 123, Edison then asked how often they used Twitter.  85% of those respondents (105 people) indicated they used Twitter at least once a month, and were thus recoded as “Monthly Twitter Users.”  Herein lies the problem: It was from these 105 individuals (not the 1750 total respondents) that Edison based its estimates of Twitter use.

Let’s return to sampling error.  Because random samples are asymptotically efficient, a sample of 1750 can speak to a population of hundreds of millions almost as well as a sample of 2000, 3000, or even 5000.  But a sample of 105 people speaking to the very large userbase (self reported at 100 million) of Twitter? Not so efficient.  The margins of error are approximately +/- 10% at an alpha of .05, +/- 12.5 at an alpha of .01.  And these margins assume true equal probability of selection, and no nonresponse bias.  With weighting for proportionality, it is almost certain these margins increase substantially (1).

Let’s explore what this means practically.  First, Edison Research can’t speak to all Twitter users, because all Twitter users weren’t potentially included in the sample.  Edison can, however, speak to USA Twitter use, from its sample of 105 monthly users.  If we assume that only 5 million Twitter users in the USA use the service every month, Edison is still using 105 people to speak about these 5 million people (the margins of error don’t change).  Unfortunately, this is highly unreliable.

The American Community Survey finds that approximately 13.1% of the US population self identifies as Black or African American.  At an alpha of .05, the range of potentially true estimates of African-American Twitter use in the US is actually anywhere from 14% to 34%.  At an alpha of .01, this estimate ranges anywhere from 11% to almost 38%, causing us to reject the hypothesis that the estimate is not attributable to sampling error or random effects.  If we then include weights in our estimates of error (likely the case because Edison’s sample over-represents people under 24), the growth in error causes us to fail to reject the null hypothesis at the .05 level as well.  We just can’t trust that the demographics of Twitter actually do vary from current population estimates.

Is Twitter “disproportionately” African American, White, Hispanic, or Green?  The simple fact is that from this data, we can’t say so with confidence.  If Edison had been a little more forthcoming with their sample sizes, it might be more likely that the blogger/journalist who reported these data would have sensed something wrong.  But I wouldn’t bank on it, because it seems like Edison Research was pushing this spin from the get-go.

A final note: as I was researching/considering this piece, it was interesting to see the “spin” being placed on this “fact” around the blogosphere.  Of course, you had your standard racist comments/tweets of the “there goes the neighborhood” variety, but there also appeared to be a large swath of users who were heralding this as a point of pride.  Before you examine my subconscious racist motives for examining this question, please just know I like getting surveys right.  And if Edison wanted to get this right, they could start by giving us a topline cross-tab of ethnicity, Twitter use, and the respective margins of error.

Ugh, footnotes on a blog!

1. Research consistently demonstrates a negatively correlated relationship between age and nonresponse; young users are more likely to under-respond, increasing their odds of being weighted in a population (and increasing their margins of error).  Research is mixed on the relationship between ethnicity and nonresponse.


22
Apr 10

Social Technology and Teenage Discussion Networks

On Tuesday, the Pew Internet and American Life Project released a new, must-read report on Teens and Mobile Phones.  The project was a collaboration between Pew and the University of Michigan’s Communication Studies department, and it involves some of the top researchers of teens and technology (Amanda Lenhart, Richard Ling, Scott Campbell and Kristen Purcell).

In addition to releasing the great report, Pew did something new by simultaneously releasing the data sets used in the report (if I’m not mistaken, they’re usually embargoed a few months).  As someone who pays very close attention to Pew’s research, I was very pleased to see this – if I have questions or want to explore something further, I could go right to the data.

One of the questions in the Pew report was a modification of the General Social Survey’s (GSS) “discussion networks” question.  Questions of this sort ask individuals to list how many people with which they can discuss personal matters, which seems to be a good proxy for one’s close, supportive network.  Using the GSS data, Peter Marsden found in 1987 that Americans, on average, have three discussants.  Replicating the analysis in 2006, McPherson and colleagues found that discussion networks had shrunk to an average of two.  There’s been plenty of criticism of the measure (my favorite being Peter Bearman’s Headless frogs.. paper, see also Fischer, 2009).  Most recently, Hampton and colleagues explored the effect of technology on discussion networks in a great Pew report entitled Social Isolation and New Technology.

One of the great promises of “social technologies” is that they connect us to important others.  By participating in a social network site, for example, we’re able to keep in touch with a broader range of diverse contacts.  Critics are quick to point out that all those ties may be meaningless; in research, we draw distinctions between tie strength.   Ellison and colleagues have demonstrated that use of Facebook among undergraduates increases a form of bridging (weak-tie) social capital.  The “important matters” question, on the other hand, is more reflective of bonding (strong-tie) relations.  Therefore we can use Pew’s new data to explore the relationship between use (and intensity of use) of social technologies and a teenager’s strong-tie supportive network.

First, some important notes.  From hereon I am going to be talking about novel data analysis.  This is a blog post, so I am going to keep the reporting informal.  If you wish to explore my analysis, or re-run it, I have included a zip file that contains the questionnaire, data, reasonably commented do-file and output log.  Sorry, R fans, Stata wins for survey analysis; these files are compatible with Stata 11.  The analysis I’ll be talking about is weighted (individuals as PSU, using PSRAI’s omnibus weight).  The dependent variable is an overdispersed (mean=~5, variance=~10) count, the proper regression being negative binomial (confirmed with LR test on the alpha).  Finally, the question explored in this analysis is not a direct match to the GSS question, it is actually quite different (GSS is a name generator).  Therefore, the results are not directly comparable, but they are likely informative.  See the Pew report methodology section for a full description of the sample.

Teenage Discussion Networks

For the Teens and Mobile Technology study, interviewers spoke to 800 teens age 12-17, asking a range of questions about technology use.  Included in the questionnaire was the question about discussion networks.  In this questions, interviewers asked how many people the individual “feel[s] very close to and with whom you are frequently in contact to discuss various things, including your personal issues and feelings.”  The mean response was a little over 5, with a standard deviation of three.  The density plot is included at right.

First, I explored if demographic and socio-economic factors were associated with the size of teenage discussion networks.  Pew collected data on age, gender, family income, parent’s ethnicity, and total number of kids in the household.  These variables could impact the teen’s ability to form discussion networks for a variety of reasons, so it is worthwhile to retain them as control variables.  I found only one variable significant: being of “black, non-hispanic” parentage.  Compared to teens of “white, non-hispanic” parentage, teens of “black, non-hispanic” parentage have a lower incidence rate of reported discussants (IRR=.8041, p=0.011, Model1.pdf).

Next, I wanted to explore the effects of internet use, social network site use, and mobile phone ownership on the size of teenage discussion network, controlling for demographic factors.  I found that use of the internet, use of social network site, and ownership of a mobile phone were all positively and significantly (p<.05) associated with the size of the support network (Model2.pdf).  Importantly, ethnicity remained negative and significant, indicating that teens of “black, non-hispanic” parentage do not make up the gap in the support network size due to technology use.

Of course, most teens do not use technology in isolation.  In fact, Pew’s report indicates that most teens use the internet, SNS, and mobile phones in combination.  Therefore, we should explore the effects of these technologies simultaneously to identify the robust contribution to the size of the discussion network.  When we evaluate these simultaneously controlling for demographic factors, we find that internet use and mobile phone use no longer significantly contribute to the size of a teen’s discussion network.  Use of social network sites, however, remains significant (IRR=1.142, p=.028, Model3.pdf).  It appears that teens who use social network sites are more likely to report larger discussion networks.  This is pretty impressive!

Before we get too excited about the promise of social network sites, let’s consider what we know about them.  For most teens, the social network site represents an online space for interacting with offline friends.  If use of the social network site really adds people to the core discussion network, where are they coming from?  Couldn’t an alternate explanation be that individuals who are more social offline are also more social online?  Pew also asked about frequency of offline socialization, and we can enter this measure as a control in our model.  When we do, we see that none of the technologies remain significant, and offline interaction emerges as a significant predictor (IRR=1.074, p=.010, Model3.pdf).  It turns out that teens that are more active with their friends have larger discussion networks, controlling for demographics and social technology use.

Some Discussion

It should be noted that Pew’s report did contain a number of “technology intensity” or “differential technology use” variables (e.g. how often do you…).  I included these in my exploratory analysis and none were significant, so I focused on use effects.  In the study of “social impact of technology”, there is a long history of attribution error regarding the “effects of technology.”  My goals in this analysis were twofold: First, to explore a re-occurring question that is addressable with Pew’s data (is technology use robustly associated with larger discussion networks), and to explore some alternate hypotheses to the findings (a common theme in “discussion networks” research).

What I see in this data is a manifestation of the ubiquity of technology in teenage life.  If our technology is used to connect to those around us, the effects of the technology will be constrained within the social setting.  What we may be seeing here is that teens that are already outgoing are more likely to use social technologies.  That is, the use of the network is built into the everyday processes that would be associated with the growth of a discussion/support network.  This finding is mundane, but it begs the question – how might we leverage technologies to enable less outgoing teenagers to expand their support networks?

Finally, please treat this post as a rough draft, a work in progress.  The fact I feel it is acceptable to write a blog post like this is evidence I’ve been in grad school too long, so it is time to get back to my dissertation.

Ugh, Citations on a blog!

  • Bearman, P. and Parigi, P.  (2004).  Cloning Headless Frogs and Other Important Matters: Conversation Topics and Network Structure. Social Forces, 83(2), 535–557.
  • Ellison, N. B., Steinfield, C., and Lampe, C.  (2007).  The Benefits of Facebook “Friends:” Social Capital and College Students’ Use of Online Social Network Sites.  Journal of Computer Mediated Communications, 12(4).
  • Fischer, C. S.  (2009).  The 2004 GSS Finding of Shrunken Social Networks: An Artifact?.  American Sociological Review, 74(4), 657–669.
  • Hampton, K., Sessions, L., Her, E. J., and Rainie, L.  (November 4, 2009).  Social Isolation and New Technology.  Pew Internet and American Life Project.  Retrieved November 4, 2009 from http://www.pewinternet.org/Reports/2009/18–Social-Isolation-and-New-Technology.aspx.
  • Marsden, P. V.  (1987).  Core Discussion Networks of Americans.  American Sociological Review, 52(1), 122-131.
  • McPherson, M., Smith-Lovin, L., and Brashears, M.  (2006).  Social Isolation in America: Changes in Core Discussion Networks over Two Decades.  American Sociological Review, 71(3), 353-375.

20
Apr 10

Announcing Freedom for Windows

I’m very pleased to announce that Freedom, my internet-blocking productivity software, is now available for Windows!

Over the past two years, countless people have written to me, asking if there is a version of Freedom for Windows.  I hated telling people that they couldn’t have Freedom.  I’m happy to report that if you’ve got a Windows XP, Vista, or 7 computer, you too can now experience Freedom.

Want to know a little more about Freedom?  Read about it in the New York Times Magazine, Salon.com, USA Today, Chronicle of Higher Education, LifeHacker, and others.  I’m also quite partial to the recent article on Freedom in the Guardian that starts: “With the help of a lovely man called Fred, I’m no longer in thrall to SamCam’s cape and Guido Fawkes.”

Let me know what you think!


29
Mar 10

Facebook Again to Test Privacy Boundaries

I’ve been paying attention to the discussion regarding Facebook’s proposed changes to the privacy policy (so has Michael Zimmer, TechCrunch, RWW and VentureBeat).   The most controversial is a proposal for Facebook to automatically share personal information with third party websites.  The mechanics go something like this: If you’re logged in to Facebook, and you visit a third-party site that has an established relationship with Facebook, Facebook will provide the website with your General Information, which is:

“your and your friends’ names, profile pictures, gender, user IDs, connections, and any content shared using the Everyone privacy setting.”

How would this work in practice?  Let’s imagine that CNN and Facebook team up.  If you’re logged into Facebook and visit CNN, the website would be able to welcome you by your full name, display gender-relevant content, show you recommendations from the people in your network who also visit CNN, and so on.  Going a little further, if you share your interest information, CNN might be able to dynamically display stories that match your interests.

The level of disclosure proposed in this new policy is similar (or even identical) to the information disclosure required for use of a Facebook app.  The critical difference in the new policy is that while applications require an opt-in, it appears that this new process will require an opt-out.  Facebook spokesperson Barry Schnitt:

“The opt-out hasn’t been built yet. We just want people to know they’ll be able to opt out. We’ve made that commitment. There will be an opt-out right when the user gets to the site, and there will be some opt-out functionality on Facebook. But as to where the button will be or how it will look, I don’t know, because they don’t exist right now.”

In theory, there will be two opt-outs.  The first will be the hypothetical button that Schnitt talks about.  The second will be to log out of Facebook and remove the Facebook cookie.  In reality, though, if you’re a Facebook user, you can never really opt-out, because any time a Facebook friend visits a third party site Facebook will share some of your information with that site.

Although it is a good sign that Facebook has gone on record regarding privacy control, the previous comment reveals Facebook’s cavalier attitude towards privacy.  Quite literally, they’re talking about pushing identity information of 400 million people around, yet privacy is treated as an afterthought – something they’ll figure out later.  When will companies like Facebook and Google start bringing privacy teams in at the beginning of the design process, rather than at the end?

Shifting topics a little bit, I see this move as notable because it marks Facebook’s first foray into large-scale warehoused behavioral targeting.  Targeting companies like Doubleclick (owned by Google) routinely mine our travels around the web, allowing third-party consumers to generate targeted recommendations based on our habits.  Because this happens behind the scenes, we’re less likely to notice it (which doesn’t make it any less troubling).  Facebook’s move stands to confront us with behavioral targeting, and they should consider the boundary they’re confronting.  It may not seem to be a big thing to have a third party website welcome you by your first and last name, but it is a paradigm shift on the web.

TechCrunch argues that it is time to sharpen the pitchforks, in preparation for the major backlash against the service.  Let me explain why this is frustrating.  In my opinion, the role of the privacy team is to navigate the necessary tension between our freedoms to disclose and how companies can ethically and morally profit from our data.  Facebook’s failures with Beacon or Google’s failure with Buzz are not “wins” for privacy; rather, they are losses for companies, consumers, and the market.

This brings me back to what is troubling about the “sharpening pitchforks” mentality.  It doesn’t and shouldn’t have to be this way.  Compared to Doubliclick, Facebook’s move really isn’t any more troubling – if the system is implemented properly.  And if the system is implemented properly, it could be a win – for consumers, for Facebook, and for third parties.  So how can Facebook navigate this challenge?  Let’s start with research, sensible design, and a different style of rollout than the traditional ask-for-forgiveness-later approach Facebook seems to believe in.

At Facebook’s current size and scale, they can’t afford to get this wrong.  Through research, testing, and a willingness to put the customer first, Facebook could navigate the challenges of this new feature.  But make no mistake, more than anyone, they are in the bulls eye right now.  And if Facebook does decide to play cavalier with privacy, the mobs TechCrunch describe will be waiting.


11
Feb 10

Google Buzz as Experience Pattern

Although Google has made a number of ventures into the social space (Wave, Latitude, Profiles, etc.), they have yet to capture audience and mindshare like competitors Twitter, Facebook, or Myspace. Google Buzz, an interesting and troubling new foray by the company, is the most recent shot across the bow of the major social players. Having spent a day or two looking at the product, I’ll share some of my early feedback.

At its core, Buzz is a content sharing network that exists loosely within Gmail and Google profiles; content shared publicly within Buzz is populated within Gmail and to the profile. To participate in Buzz, one must agree to have a Google profile, a place “visible on the web so friends can find and recognize you.” Notably, the Google profile is at the center of Google’s social search efforts.

As I’ve written in the past, Google’s has had to walk a very fine line with how they “reveal” what they know about your social circle. Realistically, Google sits on behavioral social network data that is of equal value to what is created in Facebook or Myspace. Mining our web search patterns, our chat and email logs, and our travels across the web with analytics, Google knows who we connect with. The challenge Google has always faced is putting this information into play in a way that doesn’t freak everyone out. Google Profiles were a first step in that direction, asking people to list their sites and friends with a promise of better search positioning (sound familiar?).

As the profile and social network play an increasingly important component in search relevance judgments, it is in Google’s interest to leverage the vast social network data it already has. With Buzz, Google can pre-populate friends lists in a slightly less than creepy way, and then leverage that information in social search via the profile. The big win with Buzz isn’t Google’s competition into Twitter or Facebook’s space (Buzz isn’t vaguely a Facebook or Twitter killer), but rather the data value Google is going to reap through a massive profile-creation effort. Buzz just might be glue necessary to encourage people to articulate the extant network connections in Google. For Google, it is more important that you use Buzz once than if you use it on an ongoing basis.

What are the implications of a system like Buzz? It is a pretty interesting case of what might be thought of as data leveraging. As more of our patterns are analyzed in a range of systems, corporations are going to be challenged manage our confrontation with our patterns.  I recall a conference I attended where some senior developers at a messaging analytics firm were discussing the creepiness factor inherent in showing people their behavioral patterns.  Experience and design patterns are rare for this sort of confrontation, and Buzz is an interesting case.

If such data-leveraged confrontations are going to become more frequent, we are challenged by implications of ubiquitous recording.  For example, Buzz is pre-populated with the “people you email and chat with most.”  One doesn’t need to be a Goffman scholar to know that a public listing of the people we chat with most presents social hazard.  It is remarkable to observe how often companies get the defaults of sharing wrong.

While we’re talking about privacy implications of Buzz, here are a few other points I’ve noticed as I look over the terms of service.

  • As I’ve mentioned, to use Google Buzz you must agree to have a Google profile created in your name.  Doing so shares things like your contact network and your accounts on other Google sites.  These are the defaults, which can be changed with effort.
  • You can’t delete your Google Buzz account.  If you create a Google Buzz account and wish to delete it, you have to delete your entire Google profile (killing your search listing, etc. at the same time).
  • If you wish to remove Google Buzz items, you must find them and delete them (“You have the option to remove your comments on others’ posts individually if you’d like”)
  • Finally, you are required to use your first and last name in Google Buzz (“you need to have a public Google profile which at a minimum includes your first and last name”)

Google Buzz is a truly interesting attempt from Google to leverage the vast social network resources that exist within their systems.  The implications of such a move are profound – for the company and for our experience with large-scale behavioral information.

Update: Google has responded to widespread criticism of the privacy defaults issue.  Many are uncomfortable with the following/follower pre-population, so Google has offered some new ways to manage these lists.  Notably, Google’s post still asks that you create a public profile – something completely unnecessary if you don’t have one already (that is, no one can see your pre-assigned contacts until they’re shared with a Google Profile).

I know this is cynical, but I’ve seen this happen enough to know that these “mistakes” often aren’t accidental.  As danah boyd has written previously, “In other words, this is “slippery slope” software development. Given what I’ve learned from interviewing teens and college students over the years, they have *no* idea that these changes are taking place (until an incident occurs).”  It is disturbing to see Google going down this road, trading PR for personal information as a calculated trust violation.