Identity and the Web: Information Science Must Pay Attention

I’ve come across a number of articles in prominent publications regarding the problems encountered by individuals when potential employers Google them. In a BusinessWeek article entitled You Are What You Post: Bosses are using Google to peer into places job interviews can’t take them, the author highlights a number of horror stories regarding job seekers and their “Google resume.” The New York Daily News ran a similar story, What a tangled Web we weave: Being Google can jeopardize your job search. In both stories, we see the huge information problem being faced by individuals in their relationship with search engines.

A few months ago, Terrell and I sat down at a whiteboard and began diagramming this problem. The result of our work was claimID. An individual has a relationship with the information about them online; when they Google their name, they are the only possible arbiter of truth as to what is really about them. This is further complicated by the fact that the individual often has no control over what is displayed about them, or in what order the information is presented. At the same time, full-text search only returns matches; things that are about us, but don’t directly mention our names, fail to exist for your searchers.

These problems are discrete, but they interrelate variably – with compounding effects. I’ll explore them a little; primarily, they shake out to be problems of disambiguation, aboutness, authorship and presentation.

  1. Disambiguation. We share names. In technology, this is referred to as a namespace problem. When two things share a name, it is impossible for a machine to differentiate the two. The developers of the internet solved the namespace problem by fiat; they simply made a rule that nothing can share a name. As humans, we can’t really retrofit a namespace solution, so this problem is essentially unsolvable. As long as there is free text on the net referencing a name, there will always be a question of who that name references. There are “solutions” to the namespace problem; federation of identity is the logical step. Regardless, adoption and use of these tools will be outweighed by the simple fact that the namespace problem will always persist.
  2. Aboutness. Aboutness is a fancy term that well, just sort of means what a thing is about. When we search for someone, we are essentially asking a search engine to show us things “about” the person; unfortunately these are the types of queries that search engines perform worst. Just like asking a search engine to describe how a vintage wine tastes, the true answer to questions of identity are incredibly complex. The search engine simply relies on brute force, and just returns anything that matches your name. However, think about all the things on the web that are about individuals – they schools they went to, the towns they lived in. This is just a start. Think about less subtle things – articles in a newspaper that refer to projects an individual worked on (but don’t mention the individual’s name), or a flattering blog post by a friend that just uses an individual’s first name. It is very easy to argue that all of these things, all of these information nodes on the web make up a person’s identity. A search engine figuring this out is obviously beyond the scope of any technology we’d ever be comfortable with, so holistic answers to questions of aboutness are very difficult.
  3. Authorship. Another extremely simple concept, authorship gets complicated from two perspectives in the context of identity search. The more traditional understanding applies; who wrote the things about an individual online? If your name shows up on a forum posting attached to a handle, how would an outside observer know who wrote this? The Namespace problem forces “handles”, and while we might try to use the same handle across the net, chances are we’ve got any number of different logins. The simple fact is that names can show up anyplace, attached to any handle – and for the most part, only the person whose name it is can make that authorship disambiguation. Moving on to a more conceptual understanding, our identity is also made up of what we author. Of course, my blog is about me – even when I’m not ostensibly writing about myself. The case of a newspaper reporter is probably the most effective – When that reporter writes about someone else, they are adding more evidence to their identity as a writer. With handles and famously nonexistent bylines, authorship becomes much more complicated than it needs be on the internet.
  4. Presentation. The simplest, and perhaps most frustrating concept. When someone searches for an individual, it is Google that gets to decide how the results are presented. A great article mentioning that person could end up on page 1 or page 10 of the results – it is really a crapshoot. Imagine if we approached our resumes this way? When an employer looks at a job candidate, it seems to be a fair assumption that the employer googles them – and it is scary to think how that is weighted against the actual paper resume. When I was going through my Google interviews, I answered many a question regarding stuff found out via, well, Google.

In designing claimID, we took the stance that only the individual can speak for what is about them online; therefore, the process of identity sharing must be taken on by the individual. In a sense, I really think that claimID is a nice solution; our tool is extremely simple to use, built for the “rest of us” – the average individual who wishes to speak for their identity online. We explicitly didn’t build a tool to wow A-list bloggers; the focus in claimID has always been service, with as little navel-gazing as possible. The audience for a tool like claimID is so huge, it simply must be built to do one thing, and to do it very well.

However, as I read the articles about this “growing problem” of employers Googling potential employees, and the approach to the problem seems to be throwing up one’s hands in the air, it strikes me that we may be facing one of the biggest problems of the net’s future. Working on a college campus, I see tens of thousands of students putting their identity online everyday – be it in Myspace, or forum postings, email listservs or other services. This is our future – a good deal of us will live publicly, and we’re simply going to need tools to cope with what is about us online – especially the things we can’t control. I imagine how stressful it is for a person who has something embarrassing about them online to take on a job search – search engine results could literally affect the course of their lives, their mental and emotional state.

If there’s anything that surprised me about the claimID beta, it is the scope of the international coverage we’re getting in the blogs. We’ve let a fair amount of users in (think very low thousands), and Technorati shows 170 posts about claimID, many in foreign languages. This proves, beyond a doubt, that identity problems are faced by the whole internet. People are searching for a solution, and while claimID goes a long way, this problem is bigger than our solution. It is imperative that information science takes note, and the greater community starts working to assist people. We have the choice of either throwing our hands up, or doing something about it. Indeed, there are problems we may never be able to solve (namespace, etc), but creative thinking and discussion will lead to solutions that can benefit us all.

It feels a little weird to be essentially inviting competition to join our space, but the simple fact is this is a real problem, that only gets more important over time. It is also inspiring because the problem is such an classic information science problem. Those of us who’ve joined IS departments did so because we want to develop solutions for the issues that emerge in the ongoing relationship between humans and computers. That we could join together and start solving this problem is deeply inspiring, but even more so, absolutely essential.

Tags: , ,

One comment

  1. While it’s true that disambiguation of names in unstructured text is a tricky issue there are methods (which require more than just a name) that can get use closer.

    These methods are necessary and hopefully they will see more general adoption as free text in general is everywhere and will be for sometime to come.

    That said, I love the concept of ClaimID! I would also be interested in seeing if by incorporating some unstructured text learning techniques it could be pushed further.

    Keep up the great work!

    Christopher

Leave a comment