Unit Structures Fred Stutzman’s thoughts about information, social networks and technology.

Flickr View All » Farmer's Market BountyApril snow in BooneChancellor's Parking SpotEaster in BooneEaster in BooneFreedom on the front of SalonWeb Survey DeffCampus Hawk!Campus Hawk!

Posted
Jun 26 2009, 10:29 am

Comments
5 Comments

Categories
Research

Bookmark
Post to

Recruiting Participants for a Research Study

Woody Hartzog and I are recruiting subjects for a study of privacy behaviors in online social networks (note: Twitter counts!).  If you’d like to participate (and you can remotely – via phone, Skype, etc), please be in touch.  The official recruitment script follows:

UNC-Chapel Hill researchers are conducting a study of privacy behaviors in social networking sites (Facebook, Myspace).  We seek individuals who maintain multiple profiles, and purposefully keep their profiles separate on social networking sites.  For example, people who maintain a “work profile” and a “personal profile” on a social networking site.  If you meet this criteria, we are interested in your opinions about privacy, as well as the social implications of maintaining multiple profiles.

To qualify for this research, you must be age 24 or older, have started using social networking sites within the last two years, and maintain multiple profiles (e.g. a “work profile” and a “personal profile”) on social networking sites.  Participation is entirely voluntary.  Individuals who wish to participate will be interviewed for one hour, and will be compensated $10.00 for their time.  Interviews can be in person, or remotely (over the phone, via Skype, etc.).  To volunteer for participation, or ask any questions about the project, please email Principal Investigator Fred Stutzman at fred@fredstutzman.com.  If you prefer, you may call 919-260-8508.

This research has been approved by the University of North Carolina Institutional Review Board, IRB-09-1078.  Gary Marchionini, Ph.D., Cary C. Boshamer Distinguished Professor in the School of Information and Library Science, is faculty supervisor of this study.

Put simply, we’re looking for people who are fairly recent adopters of SNS, that maintain more than one profile on a site or sites (e.g. maintain a “personal” and “professional” identity that is separate), or who attempt to segment their identity in social network sites (create “friend lists” for coworkers, friends, or family, for instance).  If you have any questions, please feel free to leave them in the comments or email me directly.


Posted
Jun 24 2009, 2:16 pm

Comments
2 Comments

Categories
Thoughts

Tags
,

Bookmark
Post to

The Great Wall of Facebook

Fred Vogelstein has an interesting article in the new edition of Wired, previewing Facebook’s full-on assault of Google for targeted advertising territory.  The article makes news, and includes some great (and painfully ironic quotes) from Mark Zuckerberg in which he accuses Google of contributing to the surveillance society (Pot, Kettle, Black).  The article reads like a preview for the Super Bowl, with notoriously tight-lipped executives tossing bombs back and forth.  Congrats to Vogelstein for successfully stoking the ire of these monoliths.

The fundamental conflict of the article lies in the comparison of the advertising products offered by the two companies.  Google’s product, targeted text ads, is the single most successful product on the Internet.  The tiny, unobstructive ads have fueled Google’s dominance in multiple markets; today, 90% of Google’s revenue comes from Adsense.  Facebook’s product is nascent – it is the concept that advertising works better when it is socially mediated.  That is, we are more likely to click on ads, content, and links when the content is funneled through our friends.  This theory is sensible, but to date, Facebook’s concept remains vaporware, with a majority of their revenue coming through traditional targeted text and banner campaigns.

Framed by Zuckerberg, the contrast between Facebook and Google is personal vs. impersonal.  Of Google he states: “You have a bunch of machines and algorithms going out and crawling the Web and bringing information back.  That only gets stuff that is publicly available to everyone. And it doesn’t give people the control that they need to be really comfortable.”  Vogelstein writes:

Facebook CEO Mark Zuckerberg envisions a more personalized, humanized Web, where our network of friends, colleagues, peers, and family is our primary source of information, just as it is offline. In Zuckerberg’s vision, users will query this “social graph” to find a doctor, the best camera, or someone to hire—rather than tapping the cold mathematics of a Google search. It is a complete rethinking of how we navigate the online world, one that places Facebook right at the center. In other words, right where Google is now.

Personal vs. impersonal.  Wouldn’t you rather get a doctor recommendation from ten of your friends than a text link?  The value of peer recommendations have driven many communities, including countless bulletin boards and fora, sites like epinions and Yelp, and members-only specialist communities.  The fundamental problem with monetization in Facebook’s case lies with norms that govern the exchange of advice, particularly that the advice be truthful and unbiased.  If we are to trust advice, we must know that external agents aren’t corrupting or influencing the transmission of advice.  We can get advice from Facebook regrading doctors, but we won’t trust the advice if Facebook pays our friends to recommend certain doctors.

Facebook’s grand vision involves a wholly-contained world of social information that is brokered out through the web.  With enough critical mass, it is argued, most of our common information needs can be answered by our social networks.  With most technological main effect hypotheses, the formulation is generally suspect.  Researchers of social support argue that support is more effectively derived from certain actors, that support is contextual, etc.  In a traditional model, where the people around you are the primary producers of information, your personal support network is crucial.  With the advent of the Internet, however, most of us no longer exist in a traditional model where the people around us are our only support vector (1).

The reality is that Google, and other search engines, have restructured expectations regarding everyday information seeking.  It is no longer good enough to simply get recommendations from a personal network when there is a vast quantity of electronic information available at one’s fingertips.  You can certainly get doctor recommendations from your friends, but the online search for information about the doctor is now a natural part of the information seeking process.  In this sense, Facebook is complementary, providing an important but not all-encompassing factor in our decision making process.  The argument that individuals will move their information seeking to a social network, and away from the mechanistic site Google simply assumes too much.  Google has already won by making itself an integral part of our everyday information seeking processes.

If Facebook (a proxy for “socially mediated search”) is a complementary and useful part of everyday information seeking, we must consider the relevance of information we get from the site.  We generally assess relevance in information systems through “recall” and “precision.”  In Facebook, recall is strictly bound to our known social world – the people who we have connected with.  Therefore, precision is a function of how well the various others producing results match our needs.  If you have 500 friends, spaced across a variety of age ranges, is it safe to assume that information you get from the network will actually be all that relevant?  Our core social networks are generally homophilous, but our core social networks are very small.  Expand past a certain network size and it becomes likely the interests and experience of your “friends” will vary significantly from yours.

Facebook could address this problem with friend lists, the privacy feature that compels individuals to place their friends in groups.  Perhaps friend lists could be converted to interest groups (People whose book recommendations I trust), but the mechanics of a process would require a good bit of intervention on behalf of the user.  The participation gap is also problematic – if the people who you really trust for book recommendations are not heavy users of Facebook, then it is unlikely you’ll have your information needs addressed.

Facebook could develop algorithms that look for similarity between question askers and answerers – if I ask for a book recommendation, perhaps Facebook could weight responses from people who share my stated book tastes.  This compels participation and broadcast of information, one of Michael Zimmer’s new laws of social networking.

Although the debate framed by Vogelstein and Zuckerberg is Facebook vs. Google, there is actually very little opportunity for Facebook to significantly edge into Google’s core market – targeted text-link ads.  Text link ads are served as a by-product of information search, which is an integral part of our everyday information seeking processes.  Facebook is likely to emerge as a complement to search, and in some areas it may perform better than search, but search will remain relevant.  The challenge to Facebook is to find a way to monetize their value areas without being in contravention of social norms.  The challenge to Google is to get access to the wealth of personal data Facebook is collecting (and no, Google Friend Connect and all of their other terrifically lame social products, will solve this problem).  For the consumer, the battle between Google and Facebook is a win-win, with the obvious exception of privacy matters.

(1) Those with “impoverished life-worlds” – those with limited access to information and resources, are unlikely to incorporate search engines or social networks into their everyday information search processes.


Posted
Jun 23 2009, 2:01 pm

Comments
No Comments

Categories
Noticed

Tags
,

Bookmark
Post to

Appearance on Morning Edition

I made a very, very brief appearance on Morning Edition this AM:

Fred Stutzman, who studies social networks at the University of North Carolina, thinks charging for services will turn out to be the best way for social networks to get profitable.

“People will pay for good technology,” he says. “People will pay for a responsive company.”

He points to the professional networking site LinkedIn. It offers some free services, but users pay for a premium level with more features. With only 40 million users, LinkedIn is significantly smaller than Facebook or MySpace, but it’s making a profit.

Facebook, though, may face a bit of a conundrum. There are two groups on the site called “We Will Not Pay To Use Facebook. If This Happens We Are Gone.” Their combined membership? Nearly 8 million.

Stutzman thinks that ultimately Facebook, MySpace and Twitter are going to be around for a long time. They just might not be the big cash cows that some people expect.

Unfortunately I missed the live broadcast.


Posted
Jun 18 2009, 6:40 pm

Comments
No Comments

Categories
Noticed

Tags
,

Bookmark
Post to

Zimmer on the Facebook Dataset

Michael Zimmer has released a new critique of the “Facebook Dataset” – and it is well worth reading.

Recall that last fall, a group of researchers affiliated with the Berkman Center for Internet & Society at Harvard University released a dataset of Facebook profile information from an entire cohort (the class of 2009) of college students from “an anonymous, northeastern American university.” While the researchers took good faith steps to preserve the anonymity of the source of the data (and, presumably, the privacy of the subjects), I quickly narrowed it down to 7 possible universities, and then with only a little more effort, identified the source (with some confidence) as Harvard College. All this without ever even downloading or looking at the actual data.

Download the draft of Michael’s paper.


Posted
Jun 15 2009, 7:21 pm

Comments
No Comments

Categories
Noticed

Tags

Bookmark
Post to

Facebook passes Myspace

Via Inside Facebook: comScore: Facebook Passed MySpace in the US for the First Time in May.

It’s been a long time coming, but Facebook has finally passed MySpace in terms of total US uniques, according to comScore. In May, comScore reported 70.28 million US uniques for Facebook up 97% year over year, compared to 70.26 million for MySpace down 5% year over year.

Blogging this for posterity’s sake.


Posted
Jun 4 2009, 8:58 pm

Comments
3 Comments

Categories
Thoughts

Tags
, ,

Bookmark
Post to

Rethinking Twitter and Gender Differences

On Monday, the Harvard Business school posted a “conversation starter” study on gender differences in Twitter use.  The authors found that “men have 15% more followers than women” and “an average man is almost twice as likely to follow another man than a woman.”  The authors suggest, without empirical data, that men find the content produced by women less compelling “because of a lack of photo sharing.”  Is everyone else offended by this base characterization?

As it happens, the study has serious flaws.  I’d like to point those out, and then suggest an alternative method for addressing these questions.  Let’s start by talking methods.  This study is a survey; using a random sample of 300,000 Twitter users, the authors attempt to draw population-level inferences about “friending” behavior in Twitter.

When conducting a population survey, researchers collect a sample and attempt to use that sample to draw inferences about a population.  The difference between the “sampled” population value and the “true” population value is known as error.  Survey error (MSE) has two components: sampling error and non-sampling error.  We are most familiar with sampling error; it is the differences between the “sample” value and the “true” value attributable to the sample selection.  Non-sampling error comprises all other error non-attributable to sampling error, such as data entry error, instrument error, etc.

For the purpose of this analysis, we are going to focus primarily on sampling error.  At the study sample size of 300,000, there is very little sampling error in an infinite population.  While we generally associate a large sample size with better quality data because of this small sampling error, there are two caveats.  First, above a certain sample size, say 20,000, there is little marginal gain in the addition of sample.  The difference between an n of 500 and an n of 1000 is vast, but the difference between an n of 20,000 and an n of 40,000 is much smaller due to the properties of the normal distribution.

On paper, a larger n is always better; here is the second caveat.  When dealing with very large samples, confidence intervals used to determine significance are smaller – meaning even the most minute differences become “significant.”  Furthermore, discovery of influential data is more difficult, as those data may be sufficient in number (i.e., a pattern emerges in influential data) to influence the distribution. As any Twitter user with a public profile knows, there are certainly some “patterns” that emerge in follower behavior.

Let us revisit the purpose of the survey, which is to use a sample to draw inferences about a population with as little total error as possible.  The goal is to not achieve significant differences on wild hypotheses, it is to collect good data that represents a population.  To achieve this goal, survey designers expend a lot of effort understanding their populations, defining their sample, and working to achieve high data quality (while keeping costs under control).

Let’s say that I wanted to know the 2008 income of everyone over 18 born in my city.  So I go down to city hall, I ask for the names of everyone who was born in my city before 1991.  I then take this very large list, and cross-reference it with my magical 2008 tax records, and produce a wonderful study.  Can you spot some problems with the data?  At first, you might point out that not everyone over 18 born in my city earns an income.  Ok, that’s fine – I want to know that.  Now here’s the real problem: my city stared keeping records in 1830, meaning well over half of the people in my sample are dead, and they report no income.  Now I’ve got some highly influential data that actually looks “normal” due to attrition.

Let’s consider what we know about Twitter.  If we believe Nielsen, about 60% of people who create Twitter accounts abandon them within a month.  And if we believe the fair and balanced news organization Fox News, Twitter has a spam problem (Ok, anyone who has a public profile knows that).   What might these trends tell us about our population?  First, there will be a large cluster of inactive (attrition) users.  Second, there will likely be a large cluster of users who do not follow anyone, or follow a very small number of people (characteristic of attrited users).  Finally, since following is non-reciprocal, these attrited users (and active users) likely have their follower numbers inflated by Twitter spammers.

What do the HBS numbers tell us?  The authors find that the mean number of tweets/user is 26, but the median is 1 and 75th percentile is 4 tweets.  This indicates a highly non-normal distribution (it most likely approximates a bimodal distribution); that there are a large number of users with 0 or 1 tweets (50% of the sample – and 75% of the sample have less than 4 tweets).  This is indicative that a large portion of the sample is inactive.  (Of course, a number of these accounts could be “follower” accounts (i.e. people who do not post but follow), but I would argue this would constitute a small portion of the population). This provides good support for my first point.

My second point, non-follower data, is not addressed by the study.  They do not present information regarding the percentage of users who do not follow back, instead presenting an odds ratio that would hide the distribution of followers.  I would guess that at least 40% of the sample does not follow a user (or follows only “suggested” users).  My third point, that more people would be followed, seems to be upheld, as 80% of the sample has at least one follower.  There is likely some spam inflation there, and information about the distribution would tell us a lot.

As we can see, all signs point to low data quality, which casts all of the hypotheses and findings in serious doubt.  Just because a sample is large, and significance can be easily achieved, it doesn’t mean that data quality is good.  Unfortunately, it appears that the Harvard authors have made the error I describe in my income study – yes, they’ve collected a lot of people, but the failed to see who had died.  What good is an inference about a population if it is heavily influenced by bad data?  Don’t we actually want to know what real users are doing?

Beyond these data quality problems, there is also an issue with the gender classification; the authors rely on a corpus of names to predict gender of users.  As each name is a prediction, there is an error component associated with each name classification.  This error component must be taken into account as a function of the total variance component – meaning all of the things that looked significant may not actually be significant.

Since this is a “discussion,” I’d like to propose a method to re-run the study with better data quality (but larger standard errors).  The two main problems that will be addressed are compensation for attrition and gender classification.  To deal with Twitter attrition, let us first define it.  If we follow Neilsen’s numbers, a Twitter user that has posted at least once at >30 days and <30 days has a decent chance of being an active user.  We may want to make this criteria more lenient – perhaps just requiring one post in the last 30 days.  Either way, we must define a criteria to decide who is an active user (and this definition must be informed by data and theory).

The problem with gender is a little more difficult.  I don’t spend a lot of time in the TREC community so I’m not sure how good automated techniques are, so I’m going to propose human classification.  The most efficient way to do this is with Mechanical Turk.  Turkers could be shown a profile and asked to decide the gender of the profile owner; you’d repeat with a different rater to get an estimate of reliability.  Your guess is as good as mine about agreement – I’m generally skeptical of ethnicity ratings by third parties, but I tend to think that gender can be reasonably assessed.  Update: @yardi brings up a good point regarding brand/persona/promotional/shared accounts.  My (too simple) answer is exclusion.  If we’re truly interested in this gender question, then non-gendered accounts fit an a priori exclusion critera.  My gut instinct is that in a population sample, we would see low incidence of these accounts, and they could be analyzed separately to see how they would affect our data.

So the study would be simple – collect a first-stage sample of profiles and assess if they meet the activity criteria (this can be done automatically).  Then run a second stage random sample on the eligibles and send them to Mechanical Turk.  You could send 3000 profiles to MTurk and have them assessed, with a goal of ending up with 2400 profiles, giving you +/-2% at p<.05.  Of course, all of the “friend lists” would also have to be gender coded, so if you have an average of 10 friends you’re looking at 24,000 extra codings (minus overlap).  If we include overlap and say we’ll have 25,000 unique profile, and each profile has to be rated twice, at .01 a HIT we’re looking at a total price of 500.00.  Of course, if we pull our sample back we can reduce this cost substantially.

There are a couple of questions: First, we can’t really say how much better humans will preform at gender-coding until we run a comparison to the machine-coded results.  My gut is that humans will preform at a higher level of accuracy, but there is still a variance component with the classification.  We also don’t know what kind of bias we introduce by cutting out “follower” profiles.  I don’t know how many of these unique profiles would show up in a population survey, but it is an open question.  And what about the findings, how would they change?  My gut is that a lot of these “stunning” findings would go away, and we’d see greater gender homogeny in “following” behavior.  “Follower” behavior would still be influenced by spam, so it might be useful to assign a spam attribute to profiles to be used as a covariate (you could have MTers code them, run them through spamassassin to get a naive score, or simply use standard techniques to find influential data).

The important takeaways from this discussion is that “bigger” is not always better with social data, that data should be looked at critically before running analysis (using existing information and theory), and data that wildly contravenes existing findings should always be re-run to produce robust estimates.

Final note: The authors state that “On a typical online social network, most of the activity is focused around women – men follow content produced by women they do and do not know, and women follow content produced by women they know.”  Mike Thelwall’s (2008) large scale analysis of Myspace friending behaviors found that while females tend to friend females, there was not a significant gender effect for males.  In Mayer and Puller’s (2008) analysis of Facebook, they found that same gender was a significant predictor of friendship (in a potentially overfitted model).   Overall, studies commonly find gender differences regarding SNS/internet use; females are generally found to use communicative tools with greater intensity. (e.g. Joinson, 2008; Lenhart & Madden, 2007; Jones et al., 2009)

References:

Joinson, A. N.  (2008).  Looking at, looking up or keeping up with people?: motives and use of facebook.  In CHI ‘08: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, New York, NY, USA, 2008 (pp. 1027-1036).  ACM.

Jones, S., Johnson-Yale, C., Millermaier, S., and Perez, F. S.  (2009).  U.S. College Students’ Internet Use: Race, Gender and Digital Divides.  Journal of Computer-Mediated Communication, 14(2), 244-264.

Lenhart, A. and Madden, M.  (April 18, 2007).  Teens, Privacy and Online Social Networks: How teens manage their online identities and personal information in the age of MySpace.  Pew Internet and American Life Project.  Retrieved March 9, 2008 from http://www.pewinternet.org/PPF/r/211/report_display.asp.

Mayer, A. and Puller, S. L.  (2008).  The old boy (and girl) network: Social network formation on university campuses.  Journal of Public Economics, 92(1-2), 329-347.

Thelwall, M.  (2008).  Social networks, gender and friending: An analysis of MySpace member profiles.  Journal of the American Society for Information Science and Technology, 59(8):1321–1330


Posted
Jun 1 2009, 9:29 pm

Comments
9 Comments

Categories
Research, Thoughts

Tags
, ,

Bookmark
Post to

Second Class Citizens on the social web

Over the past few days, I’ve seen a few blog posts referencing various “studies” that claim that young people don’t use Twitter.  Apparently, this is a problem.

As reported on CNET, “99 percent of 18- to 24-year-olds have profiles on social networks, only 22 percent use Twitter, according to a new survey from Pace University and the Participatory Media Network.”  Never mind that it’s not particularly fair to compare a sector to a single product, what does the study’s methodology look like?  More bad news – the question was posed to 200 members of a volunteer panel.  A small, convenience sample provides very little inferential power; it is just as likely that this survey’s statistics looked like Pew’s numbers by chance occurrence.  However, my main goal here isn’t to rail against small or convenience samples being reported as representative – this is a pervasive problem and there’s not much that Unit Structures can do.

Rather, I’d like to question this problematization of the “fact” that Twitter’s users aren’t young.  The inherent bias in media coverage of social software is that social software is for “the young.”  If we look at the history of social networking websites, we find mixed evidence to support this theory.  For example, danah boyd’s ethnography (and my personal recollection) of Friendster was that it was a place for the late-twenties and thirty-something set.  If it weren’t for bonehead moves on behalf of Friendster’s staff, we might still be using the service.  LinkedIn, a popular and pervasive social network, has existed with an older skew for years.  Facebook’s growth after opening up?  It has been primarily dominated by older users.

This is not to say that young people aren’t important.  They are the lifeblood of a number of popular social networks, including large communities and countless smaller ones you’ll never hear about.  But why do we accept youth adoption as social fact ensuring community success?  One reason is surely that young people are trendsetters.  However, this theory of “trending” is an artifact of a pre-digital age, in which exclusivity and first-mover capitalization were required in the context of a production cycle.  What is a trend in the digital age, if I can have a perfect replica of what the kids have, streamed via cable modem?  Another reason is that young people are more connected.  There is truth here; young people are disproportionately more connected than older people, but this is also changing.

It might help to think of connectivity in two ways.  The first is traditional connectivity – the ability to access the internet.  If you look at Pew’s numbers[1], you’ll see that older users are less connected.  However, if you cut off the tail of the distribution, and consider users 60 and younger – you still find that 71% of those age 60 or younger have connectivity.  Users in their 40’s report connectivity rates in the 80’s, about 10% less than teenagers.  For a large segment of users, we actually find that teens aren’t that much more connected.

Lets consider a second notion of connectivity, which is the saturation of your online connections with friends or contacts.  Here, teens have old people beat hands down.  Teens interact more with their friends online, they manage their lives online – overall, they are more connected to their personal networks through computers.  Revisiting our first definition of connectivity, we can see that the explanation for the second definition must be heavily cultural, and not only technical.  That is, this high saturation of connectivity is because of norms within younger users, and not just because they’re so much more connected than adults.

So what does this mean for Twitter?  If Twitter’s users truly do skew older (and the difference between youngsters ‘18-24′ and oldsters ‘24-35′ was ns in Pew’s study), then Twitter benefits from what I think of as an identity-participation shift.  My basic theory argues that as social norms and personal networks reward non-deceptive identities, people are more likely to share and participate in online communities.  Put another way, as it becomes more OK to share (it stops being weird to use your real name on your Facebook profile), and more of your friends do it, you’re more likely to extend this type of participation to other parts of the web.  Notably, the driving force of this theory is simple connectivity, which establishes the preconditions for the social shifts.  For Twitter, there is a whole new old generation of web users coming online and embracing social software – because it is now socially OK to do so, because they have the connectivity and connections they need to feel worthwhile sharing, etc.  And it just so happens that a lot of these people seem to have found Twitter.

The core problem here is that we’re treating older users as second-class citizens on the social web.  I think that Twitter, and Facebook are going to serve as very useful testbeds to bat down this stereotype.  In fact, I think we may see the older user emerge as the truly first-class citizen on the social web.  As these users tend to be more settled, and going through less transitions that lead to upheval of the personal social networks, they may be more long-time users, less prone to “delete and move on” from one social site to the next.  Of course, these ideas need to be tested, and I’m right now embarking on a long-term project to explore questions like these.  If you are an older user of social software and might like to participate in my research interviews, keep watching this space for announcements.

[1] Jones, S. and Fox, S.  (January 28, 2009).  Generations Online in 2009.  Pew Internet and American Life Project.  Retrieved January 28, 2009 from http://www.pewinternet.org/PPF/r/275/source/rss/report_display.asp.


← Before