Research


23
Aug 10

Next Steps

I’m pleased to report that I have accepted an offer to join Carnegie Mellon University’s Heinz College as a post-doctoral fellow.  At Carnegie Mellon, I will be working with Alessandro Acquisti.  I have been following Alessandro’s excellent work on privacy and technology for many years, so I am thrilled to join his team and have him as a mentor.

Alessandro’s team has extensive experience studying privacy in online social networks.  Alessandro and Ralph Gross wrote one of the earliest (and most cited) Facebook privacy papers: Imagined Communities: Awareness, Information Sharing, and Privacy on the Facebook. Last summer, the team published a truly head-turning study, showing that information gleaned from social network profiles could be used to predict social security numbers.  Most recently, Alessandro’s work was featured in Jeffrey Rosen’s New York Times Magazine article The Web Means the End of Forgetting.

I look forward to building on my current areas of research – privacy, identity and support in social networks – while being exposed to new opportunities and new challenges at CMU.  Speaking of challenges, the next challenge is a dissertation defense (later this fall) and then a move to Pittsburgh.  It has been a while since I’ve been to Pittsburgh, so I’m open to advice!


16
Aug 10

Pricing a used Honda Odyssey

One of the fascinating things about Craigslist is its informal post-sale sanctioning system.  That is, if you don’t take down your post after you sold the item, you get an increasingly annoying stream of emails from people asking questions about the item.  This continues, of course, until you actually remove the post offering the item you sold.  It is a great example of virtual community gardening.

Because of this sanctioning system, we can make a reasonable inference that items that have been taken off of Craigslist have been sold.  The items that have short lifespans on Craigslist are desirable – they are a good value, priced properly – and those with long lifespans are either unwanted or improperly priced.  I’ve recently been in the market for a used car (cough, a minivan), so I’ve been collecting information about the cars offered on Craigslist and their lifespans on the service. By looking at prices and lifespans (and a few other variables), can we automatically identify cars that offer the greatest value?

What follows are some charts from a simple survival analysis of the last 30 days of Honda Odyssey sales on Craigslist in Raleigh/Durham.  The de-duped dataset includes 55 cars (out of about 130 posts). Before you read much into the data, many of the variables I explored (mileage, model year, etc.) weren’t significant predictors of “hazard” (that is, sale). If you were able to get this data on a larger scale, it does seem likely you’d be able to identify patterns of value. That said, there is a lot of randomness is a car’s quality once it has been driven, so the value of such a model-based approach would only be in prioritizing potentially under-priced cars.

n.b.: You could also do this sort analysis on want-ads. Want-ads have a great sanctioning system, as it is pointess to pay for an ad after you’ve sold your car.

p.s.: Perhaps what is charming about Craigslist is that there isn’t any meaningful historical data. This likely generates more variability in price, leading to the perception that you can find great deals (which you can!).


4
Aug 10

Why Gender is Important in Facebook

If you recall, a few years ago Facebook forced all users to select a gender if they wanted to continue using the site.  This move generated a little controversy – some individuals didn’t feel comfortable with sharing the information, or fitting into a gender classification.  Facebook responded:

However, we’ve gotten feedback from translators and users in other countries that translations wind up being too confusing when people have not specified a sex on their profiles. People who haven’t selected what sex they are frequently get defaulted to the wrong sex entirely in Mini-Feed stories. For this reason, we’ve decided to request that all Facebook users fill out this information on their profile.

Just today, I discovered (via the R Bloggers news feed) an video on the use of R in corporations like Google and Facebook.  The representative of the Facebook data team talked about some exploratory data analysis they did in 2007.  The finding?  “If a user comes on more than once and is willing to give Facebook a very basic piece of information – their gender – that seems to be the strongest predictor of whether they will stay on the site.”

I’m not looking to stir up any controversy.  Rather, I think it is an interesting example of analytics-based development, of research informing design.  Of course, the challenge of translating research into practice is immense.  Are there critical differences between individuals that share gender and those that don’t?  Did a forced gender-selection process invalidate the predictive model?  Was the controversy over gender selection worth the predicted benefit?  Perhaps Facebook’s 500 million users owe more to gender selection than we can imagine.

Anyway, the video has some age on it, but I did enjoy hearing about Facebook’s use of R (the other analytic examples provided are cited in the “Maintained Relationships on Facebook” report, plus there are a few ICWSM papers, I believe).  You can find the full video here (doesn’t look like embed is supported).


21
Jul 10

iTunes vs. Amazon as Survey Incentive

When surveying college-age students, Amazon and iTunes e-gift cards are frequently offered as incentive for participation [1].  While I’ve frequently heard that students prefer iTunes, the administrative burden of sending iTunes gift cards is high.  The iTunes store limits each account to $100 dollars in gift card purchases per month, so if your compensation needs go over $100, you have to schlep to the store, buy gift cards, and put them in the mail.  Amazon, on the other hand, offers an effortless interface for sending gift cards and does not appear to have an unreasonable monetary restriction.  So if you choose the ease of Amazon over the shiny iTunes brand, do you lose anything?

Recently, I ran a survey of first-year students at UNC that tested preferences toward compensation.  The survey offered a dual-tier lottery compensation: Participants were entered to win an iPod touch or their choice of three gift certificates (See [2] for more on dual-tier incentives).  The three gift card choices were iTunes, Amazon, or a popular on-campus cafe, in the amount of ten dollars.  Response to the survey was good, by email-solicitations standards, at 31% (n~1200).  Males were slightly underrepresented, as is commonly the case.

So, what gift cards did my students prefer?  Clearly, the students preferred gift cards to iTunes (n=442) and Amazon (n=442) over the local cafe (n=131).  And we don’t really need any significance tests to see that the difference between iTunes and Amazon is a wash (p=.8406).

When conducting surveys, we’re not always interested in a large homogeneous population.  Sometimes we’re interested in sub-populations, such as certain genders, ages, or ethnicities.    Breaking the perferences out by gender, visual inspection indicates that female students prefer iTunes over Amazon, while male students prefer Amazon over iTunes.  Since neither population comes close to preferring the local cafe, I will focus on the difference between iTunes and Amazon for the rest of the analysis (i.e. drop the people who prefer the Local Cafe).

Of the students that selected Amazon or iTunes, we see that 53% of female students prefer iTunes, 47% Amazon.  Of males, 58% prefer Amazon, 42% iTunes.  The Chi-square test indicates a relationship between gender and preference (p=.001), and within-gender Chi-square goodness of fit tests indicate that while the female student preference difference is insignificant (.0922), the male preference towards Amazon is significant (p=.0064).

To test some higher order interactions, I employed a logistic regression model to test the effects of gender and a few other covariates.  First, since much of my sample is from NC, I tested to see if NC residency might contribute towards a preference.  In this model, gender remained significant, but NC residence was not significant (p=.828).  Next, looked to see if GPA might be a factor in preference.  Gender remained significant, and GPA’s p-value was low (p=.081), but not close to significance (directionality was higher GPA’s towards Amazon).

In the last two models, I looked at ethnicity and age.  In the ethnicity model, gender is significant, and only one ethnicity is significant.  Compared to other ethnicities, students who self-report as Asian demonstrate a preference towards Amazon (OR=.158, p=.000).  With age, gender again remained significant, but 19 year old students (compared to 18 year old students) seem to prefer iTunes (OR 1.49, p=.004).  Notably, a gender by age interaction was not significant, however.

To briefly review, it seems that among my population, the anecdotal preference towards iTunes is just that: anecdotal.  This is good news for me, because it is much more complicated to process iTunes gift cards than Amazon gift cards.  Some final notes: This is not really a proper experiment – such an experiment would use completely randomized solicitation.  Also, the presence of the third category (Local Cafe) is potentially troubling if being a fan of a Local Cafe also correlates to, say, being an iTunes fan or an Amazon fan.  Caveat emptor, blog post, not peer reviewed, etc.

1.  I don’t have a citation for this, but I do monitor to a number of email lists that frequently offer research solicitations.  YMMV.

2. See Li, Kaiwen (2006).  Student Preference for Survey Incentive.  UC Davis Student Affairs Research & Information Tech Report.

Finally, I promise that Amazon has not compensated me in any way, say, by sending me a bunch of gift certificates or a Nikon 12-24mm DX lens or anything like that.


24
May 10

Social Network Analysis in R

Last week, I had the pleasure of attending the 2010 Political Networks Conference.  The first day of the conference included workshop sessions led by Matthew Jackson and Carter Butts, two eminent networks researchers.  Both are now online.

The lecture by Carter Butts will be of particular interest to individuals looking to use R for social network analysis.  Butts is the author of a number of network analysis packages for R (many of which come bundled in the amazing statnet package).

Network Analysis with statnet for Individual, Organizational, and International Relations Applications by Carter Butts, University of California-Irvine

Advanced Network Analysis by Matthew O. Jackson, Stanford University

If you find these materials useful, you might also wish to check out Steve Goodreau and David Hunter’s tutorial Advanced Social Network Analysis Using R and statnet available at the Complexity and Social Networks blog.


6
May 10

What News Organizations Share With Facebook

Last month, Facebook announced a number of new features, including “personalization” (which generated significant controversy) and “social plugins.”  The plugins are described as follows:

Social plugins let you see what your friends have liked, commented on or shared on sites across the web. All social plugins are extensions of Facebook and are specifically designed so none of your data is shared with the sites on which they appear.

According to Mashable, over 50,000 plugins have been installed since the rollout.  Seeing one’s Facebook friends suddenly start showing up on third party sites has raised privacy concerns, which Facebook quickly addressed in a blog post, stating “Because [third party sites] have given Facebook this “real estate” on their sites, they do not receive or interact with the information that is contained or transmitted there.”

Here’s the rub.  By giving “real estate” to Facebook, third party sites have created a one-way mirror, allowing Facebook to peer in on what we’re doing.  If you’re logged in to Facebook, and you visit a third party page with a social plugin, Facebook knows where you’ve been.  The mechanism is simple – cookies and referrals – and it will allow Facebook to create personalized behavioral profiles that, combined with the information we articulate in Facebook, will be tremendously valuable.

To explore the privacy implications of Facebook’s social plugins, I visited the websites of the top 15 U.S. online news destinations (based on some 2009 Nielsen data), and a few honorable mentions.  I then selected a news story from the front page, and loaded the page.  I checked to see if social plugins were enabled, if the Facebook cookie was called, and if the referring page was sent to Facebook (basically, did the site identify you to Facebook, and share the page you were on).

I found that of the top 15 online news destinations, 9 were sharing information with Facebook (MSNBC, CNN, CBS, ABC, Fox News, Washington Post, and the Tribune, McClatchy and Gannett Companies[1]).  Notably, The New York Times, BBC, Yahoo News, AOL News, and Google News did not share information.  I then checked a few favorites of mine: NPR (yes), Drudge (no), Huffington Post (yes), and Politico (no).  I’ve included all of the details on a spreadsheet, embedded below or html version.

According to Nielsen, the 9 news organizations sharing information with Facebook account for over 177,161,000 monthly unique visitors.  Granted, not all of these views will go to social plugin enabled pages, and not all visitors will be logged-in Facebook users.  But with 400 million users, it is safe to assume that a substantial proportion of that information will go to Facebook.  If you stay logged in to Facebook, it is increasingly likely that Facebook will know what news you read.

My beef here isn’t necessarily with Facebook; Google and other behavioral-targeting firms have very similar SOP’s.  Rather, I’m uncomfortable that so many news organizations felt comfortable sharing the news-reading behaviors of their customers that just so happen to be logged in to Facebook.  And really, what do they get for trading this tremendously valuable asset?  I get to see that a random friend liked an article?

I think it is time that someone wrote a Firefox plugin that specifically manages the Facebook cookie, only allowing it to be accessed when someone is on Facebook proper.  Clearly, we can’t trust third parties – even reputable news organizations – to protect our data.  Here’s the spreadsheet from my analysis:

Note: For media conglomerates (Tribune, McClatchy, Gannett) I visited the flagship outlet (Chicago Trib, Sac Bee, and USA Today, respectively).


22
Apr 10

Social Technology and Teenage Discussion Networks

On Tuesday, the Pew Internet and American Life Project released a new, must-read report on Teens and Mobile Phones.  The project was a collaboration between Pew and the University of Michigan’s Communication Studies department, and it involves some of the top researchers of teens and technology (Amanda Lenhart, Richard Ling, Scott Campbell and Kristen Purcell).

In addition to releasing the great report, Pew did something new by simultaneously releasing the data sets used in the report (if I’m not mistaken, they’re usually embargoed a few months).  As someone who pays very close attention to Pew’s research, I was very pleased to see this – if I have questions or want to explore something further, I could go right to the data.

One of the questions in the Pew report was a modification of the General Social Survey’s (GSS) “discussion networks” question.  Questions of this sort ask individuals to list how many people with which they can discuss personal matters, which seems to be a good proxy for one’s close, supportive network.  Using the GSS data, Peter Marsden found in 1987 that Americans, on average, have three discussants.  Replicating the analysis in 2006, McPherson and colleagues found that discussion networks had shrunk to an average of two.  There’s been plenty of criticism of the measure (my favorite being Peter Bearman’s Headless frogs.. paper, see also Fischer, 2009).  Most recently, Hampton and colleagues explored the effect of technology on discussion networks in a great Pew report entitled Social Isolation and New Technology.

One of the great promises of “social technologies” is that they connect us to important others.  By participating in a social network site, for example, we’re able to keep in touch with a broader range of diverse contacts.  Critics are quick to point out that all those ties may be meaningless; in research, we draw distinctions between tie strength.   Ellison and colleagues have demonstrated that use of Facebook among undergraduates increases a form of bridging (weak-tie) social capital.  The “important matters” question, on the other hand, is more reflective of bonding (strong-tie) relations.  Therefore we can use Pew’s new data to explore the relationship between use (and intensity of use) of social technologies and a teenager’s strong-tie supportive network.

First, some important notes.  From hereon I am going to be talking about novel data analysis.  This is a blog post, so I am going to keep the reporting informal.  If you wish to explore my analysis, or re-run it, I have included a zip file that contains the questionnaire, data, reasonably commented do-file and output log.  Sorry, R fans, Stata wins for survey analysis; these files are compatible with Stata 11.  The analysis I’ll be talking about is weighted (individuals as PSU, using PSRAI’s omnibus weight).  The dependent variable is an overdispersed (mean=~5, variance=~10) count, the proper regression being negative binomial (confirmed with LR test on the alpha).  Finally, the question explored in this analysis is not a direct match to the GSS question, it is actually quite different (GSS is a name generator).  Therefore, the results are not directly comparable, but they are likely informative.  See the Pew report methodology section for a full description of the sample.

Teenage Discussion Networks

For the Teens and Mobile Technology study, interviewers spoke to 800 teens age 12-17, asking a range of questions about technology use.  Included in the questionnaire was the question about discussion networks.  In this questions, interviewers asked how many people the individual “feel[s] very close to and with whom you are frequently in contact to discuss various things, including your personal issues and feelings.”  The mean response was a little over 5, with a standard deviation of three.  The density plot is included at right.

First, I explored if demographic and socio-economic factors were associated with the size of teenage discussion networks.  Pew collected data on age, gender, family income, parent’s ethnicity, and total number of kids in the household.  These variables could impact the teen’s ability to form discussion networks for a variety of reasons, so it is worthwhile to retain them as control variables.  I found only one variable significant: being of “black, non-hispanic” parentage.  Compared to teens of “white, non-hispanic” parentage, teens of “black, non-hispanic” parentage have a lower incidence rate of reported discussants (IRR=.8041, p=0.011, Model1.pdf).

Next, I wanted to explore the effects of internet use, social network site use, and mobile phone ownership on the size of teenage discussion network, controlling for demographic factors.  I found that use of the internet, use of social network site, and ownership of a mobile phone were all positively and significantly (p<.05) associated with the size of the support network (Model2.pdf).  Importantly, ethnicity remained negative and significant, indicating that teens of “black, non-hispanic” parentage do not make up the gap in the support network size due to technology use.

Of course, most teens do not use technology in isolation.  In fact, Pew’s report indicates that most teens use the internet, SNS, and mobile phones in combination.  Therefore, we should explore the effects of these technologies simultaneously to identify the robust contribution to the size of the discussion network.  When we evaluate these simultaneously controlling for demographic factors, we find that internet use and mobile phone use no longer significantly contribute to the size of a teen’s discussion network.  Use of social network sites, however, remains significant (IRR=1.142, p=.028, Model3.pdf).  It appears that teens who use social network sites are more likely to report larger discussion networks.  This is pretty impressive!

Before we get too excited about the promise of social network sites, let’s consider what we know about them.  For most teens, the social network site represents an online space for interacting with offline friends.  If use of the social network site really adds people to the core discussion network, where are they coming from?  Couldn’t an alternate explanation be that individuals who are more social offline are also more social online?  Pew also asked about frequency of offline socialization, and we can enter this measure as a control in our model.  When we do, we see that none of the technologies remain significant, and offline interaction emerges as a significant predictor (IRR=1.074, p=.010, Model3.pdf).  It turns out that teens that are more active with their friends have larger discussion networks, controlling for demographics and social technology use.

Some Discussion

It should be noted that Pew’s report did contain a number of “technology intensity” or “differential technology use” variables (e.g. how often do you…).  I included these in my exploratory analysis and none were significant, so I focused on use effects.  In the study of “social impact of technology”, there is a long history of attribution error regarding the “effects of technology.”  My goals in this analysis were twofold: First, to explore a re-occurring question that is addressable with Pew’s data (is technology use robustly associated with larger discussion networks), and to explore some alternate hypotheses to the findings (a common theme in “discussion networks” research).

What I see in this data is a manifestation of the ubiquity of technology in teenage life.  If our technology is used to connect to those around us, the effects of the technology will be constrained within the social setting.  What we may be seeing here is that teens that are already outgoing are more likely to use social technologies.  That is, the use of the network is built into the everyday processes that would be associated with the growth of a discussion/support network.  This finding is mundane, but it begs the question – how might we leverage technologies to enable less outgoing teenagers to expand their support networks?

Finally, please treat this post as a rough draft, a work in progress.  The fact I feel it is acceptable to write a blog post like this is evidence I’ve been in grad school too long, so it is time to get back to my dissertation.

Ugh, Citations on a blog!

  • Bearman, P. and Parigi, P.  (2004).  Cloning Headless Frogs and Other Important Matters: Conversation Topics and Network Structure. Social Forces, 83(2), 535–557.
  • Ellison, N. B., Steinfield, C., and Lampe, C.  (2007).  The Benefits of Facebook “Friends:” Social Capital and College Students’ Use of Online Social Network Sites.  Journal of Computer Mediated Communications, 12(4).
  • Fischer, C. S.  (2009).  The 2004 GSS Finding of Shrunken Social Networks: An Artifact?.  American Sociological Review, 74(4), 657–669.
  • Hampton, K., Sessions, L., Her, E. J., and Rainie, L.  (November 4, 2009).  Social Isolation and New Technology.  Pew Internet and American Life Project.  Retrieved November 4, 2009 from http://www.pewinternet.org/Reports/2009/18–Social-Isolation-and-New-Technology.aspx.
  • Marsden, P. V.  (1987).  Core Discussion Networks of Americans.  American Sociological Review, 52(1), 122-131.
  • McPherson, M., Smith-Lovin, L., and Brashears, M.  (2006).  Social Isolation in America: Changes in Core Discussion Networks over Two Decades.  American Sociological Review, 71(3), 353-375.