Posts Tagged: tinyurl


6
Jun 08

Searching Twitter Better

Update: See Backtweets.com.

My experience watching the percolation of Freedom throughout the web was instructive – a chunk of viral traffic is moving from blogs to Twitter. If you’re not monitoring your blog/company/brand in Twitter, you probably should.

There are two major Twitter search services, Tweetscan and Summize. I’ve adopted Tweetscan – it is blazing fast and seems to have a larger corpus (i.e. more data) than Summize. Both offer RSS, so you can easily set up searches and stick them in your newsreader.

There is a major drawback to these services when it comes to searching for links. As URL shortening is very common in Twitter, and there are hundreds of URL shortening services, it is often impossible to search exhaustively for links to your domain. Unless you search for all shortened versions of your page (i.e. your link shortened by TinyUrl, Snurl, MooUrl, and so on..), you’re not going to find all of the conversation.

This problem is solvable. For a few minutes I though about building a bookmark that would compute shortened URL’s and search all of them in Tweetscan/Summize. However, this approach is horribly inefficient and I didn’t want to submit my el cheapo hosting service to the load if it went viral. Instead, the Twitter search services need to post-process URL’s they find and build an index of the canonical URL’s. This would allow me to search a URL and find all of the URL’s that eventually point to my domain, regardless of the link-shortened context.

The upside of a service building such an index would be I’d be able to find all links into my blog in one search, rather than individually searching each permalink. If Tweetscan has a post-processed index of all links pointing to permalinks inside of Unit Structures, I’d be able to find all of these links by searching on my domain.

In the meantime, has anyone run into viable stopgap solutions for this problem?


26
Jan 05

Random TinyURL Browser (Updated)

Today, I mistyped a TinyURL that was sent to me by a friend, and sure enough a random, interesting page popped up (something about hiking New Zealand, which sounded really good considering the temperature outside). I decided to write a little javascript random tinyURL browser, which you can try out below:

Click here to visit a random TinyURL page (Note – pops up a new browser. Keep clicking this link to refresh that page.)

This clicking began to fascinate me, as I realized I was in essence snooping on information transactions conducted by hundreds or thousands (depending on how many times you click, I guess) nameless people around the globe. I discovered maps, company documents, pictures, links to forums, technical documents, blog postings, ebay pages, and a whole host of other strange content. It was really quite interesting. This got me thinking about how TinyURL works, which then led me to start questioning why my “random” browser was actually giving me so many good results.

So, this is a simplistic implementation of the TinyURL hashing algorithm. According to my initial calculations (based on their published stats of 4.2 M established TinyURL’s), only 7.2 percent of clicks (Total dimension space is &Sigma k=0…5 (36^k)) will result in valid (non-TinyURL notfound) pages. I’ve been absent-mindedly clicking the link and I’m definitely not seeing 9.3 out of every 10 clicks returning the TinyURL notfound page.

The question is, why does my random implementation perform well against TinyURL’s data set? And if borne out over longer intervals, how well would it perform? I’ve managed to dig up an answer, which has two parts. The first part of the answer deals with how TinyURL assigns names for compressed URL’s, and the second involves how my algorithm assigns random values.

As we have established, there are 62,193,780 possible values for TinyURL’s. TinyURL’s are generated by a Base 36 hash (36 indicating the number of characters a-z and 0-9, the array of possible values out of which a TinyURL can be constructed), autoincremented by MySQL with an initial value count of zero. What is a hash function? Imagine five buckets in a row. Now, starting with bucket one, we’ll drop one of 36 characters into the bucket, and check to make sure that for all other bucket sets out there, there isn’t a bucket that also has that particular character in the first bucket. If there is, we’d insert a character in the second bucket and run that comparison again, and so on and so forth . A hash is extremely efficient computationally, because instead of doing a tree comparison against all 62 million values, we’re incrementally comparing a set of 5 characters that has only 36 possible values.

Imagine the scenario where you’d filled 4 buckets with characters, but you’d still found matches and had to extend to the fifth bucket. For each of the 36 values you might drop into the bucket, there exists a probability of 1 in 62 million that you’d have a match and have to start the computation over again. All things considered, those are good odds!

As I’ve attempted to demonstrate, the hashing algorithm naturally prefers to use less characters to represent the URL value. By that, I mean, if the algorithm decides that a tinyurl extention of “7ds” is not already in existence, it will use this value and not append any more characters. This means that for our two-dimensional variable space, we see a heavy front-loading effect for the starting characters of the hash. What I mean by this is that for shorter URL’s, there is a greater chance that a TinyURL has been assigned to that value.

So, how does my algorithm work? It first gets a random number 1-5, and assigns a random character for each value of the random number. For example, if 2 is generated, we might get back a result of “x2″, “ce”, “0s” and so on. Since we’ve established that the hash tends to populate smaller values first, an interesting situation arises. Here’s the power of exponents:

36^1 = 36 (eg, http://tinyurl.com/h)
36^2 = 1296 (eg, http://tinyurl.com/h3)
36^3 = 46656 (eg, http://tinyurl.com/h3j)

So, for all possible hash combinations, I’d say its fairly likely that our greedy algorithm has assigned 99.999 percent of combinations for all 3 character tinyURL’s. Expanding this to include a fourth character

36^4 = 1,679,620 (eg, http://tinyurl.com/h3jz)

Even that dimension space would be a good deal utilized, possibly 90 percent.

Seeing as my algorithm arbitrarily assigns a total string length, having something appear in our strong confidence space (4 char TinyURL or less) is 80 percent likely. In fact, getting a 5 character TinyURL that works is the real sort of anomaly in this system. I don’t have the numbers I’d need to bear out a more precise calculation, but I’d be willing to wager that 8 out of every ten clicks on that randomizer will give you a TinyURL that works (or worked at one time).

Update
Now that I basically understand how TinyURL works, it was fairly easy to implement a random TinyURL generator that displays zero preference, and uses a base 36 hash. I don’t think I’ve got TinyURL’s algorithm right, though, because I’m getting lots of 404 erors.

Almost too random TinyURL page (Note – pops up a new browser. Keep clicking this link to refresh the page.)