I just got a very nice email from Slideshare that says

"Transactional Data Mining" is being tweeted more than any other document on SlideShare right now. So we've put it on the homepage of SlideShare.net (in the "Hot on Twitter" section).

That sounds sooo exciting. People magazine, here I come!

Of course, when you look into the matter, I really don't think that I am going to have to worry about dodging paparazzi any time soon. What they mean by "tweeted more" seems to be 3 times in a day. Sounds like they should have read my paper about surprise and coincidence!

In fact, for applications like popularity ranking it is often really important to have a robust idea about what is hot and what is not. There are two very common problems to solve, one is to find out which items are rapidly rising and the other is to put a reasonable rank onto things.

The problem of what is rising is surprisingly well well solved by ranking items by the log of the ratio of the number of hits/views/tweets in a recent period of time to the number of hits in a somewhat older (and typically longer) period of time. Well, not quite the ratio. In practice, it is better to offset both numerator and denominator by a constants which are well chosen by looking at the results, but which are actually well-founded in terms of MAP estimators for rates. The reason that this works so well is that popularities are typically distributed in a roughly Zipfian way so taking log of the rate is linearly related to the log of the rank.

A large change in the log-rank of an item is a great indicator of dramatically rising popularity, but rank itself is a pretty fuzzy measurement with small counts (the guys at SlideShare need to hear this). Log of the rate, however, is linearly related to the log of rank, so change in log-rank is proportional to change in log-rate. Moreover, the offset trick when computing the rate ratio largely deals with the small count problems that you have when something goes from no hits to one hit in a time period.

The second problem is computing what the rank of an item really is given bursty nasty data like hit counts. The problem is that you only have a nice tight estimate of the hit rate for the most popular items. Since you want recent rankings, you want to use a shorter period of time for your estimates so the problem of small counts is even worse. Conceptually, a very principled way to do deal with the problem is to embrace the uncertainty and sample the rates associated with each item given the data you have and rank the items. Do this a bunch of times and you have the posterior distribution of the ranks for each item. Moreover, you also have a lookup table that tells you what any particular observed rate means relative to what rank the item might have.

This sampling can be done in a few seconds in R. I think it is cooler to add dithering do the ranked list of items so that each time you look at the list, you get another sample from this distribution. This means that items that might plausibly be in the top 100 or 1000 actually get to be there some of the time and items that are definitely in the top few places are always where they should be. It also adds dynamism to your top-n list which is always good on the web.