Gordon Data Science: February 2015

Sunday, February 22, 2015

Using minhash for set clustering

In my last post, I explained that I was trying to push web sessions into similar clusters so that I could figure out which user experiences required more customer service help.

I ran up against a computational barrier because, even though I could easily compare the similarity of two sessions, comparing every session to every other session was an exponentially greater task.

I learned about an approach called minhash which is a shortcut to clustering. It creates clusters of sets that are probably similar. It trades accuracy for processing time.

After weeks of trying to get my mind around the approach I finally settled on an analogy that made the most sense to me.

Imagine a dart board in the shape of a venn diagram. The set members that belong to both sets are inside the football shaped intersection area.

You throw several darts at the dartboard and you are guaranteed to hit a single set member, whether inside or outside the intersection. But the member you hit is totally random. Pretend you're bad at darts.

As it turns out, mathematically, the probability of each dart landing inside the intersection area is exactly the same as the Jaccard similarity score. This turns out to be very useful.

As you throw more darts - the more darts you throw that all land inside the intersection, the more likely that the sets are exactly the same.

The way the algorithm works is:

Every possible set member in the universe is assigned a random hash value for each dart that is to be thrown. E.g. Hompage -> 4 darts -> hash1 = AABB, hash2 = 88EE, hash3 =22CC, hash4 = 21EC
For each set (session), find the member of the set with the lowest (min) hash value (for the current iteration) and choose that member to represent the set (this is the dart hitting the board). Do this four times with the four different hash values.
For each set, the min hash from the four darts becomes the signature of the set e.g. session 1 signature = AABB-88EE-BBCC-87EA

All of the sets (sessions) that have the same signature are part of the same cluster.

This is insanely fast because the signatures can all be assigned in one pass through the sets. There is no need to actually calculate the true Jaccard similarity for the sets.

Even though I had access to an Aster system with a minhash function, I didn't actually need it. Doing this in plain old SQL is super fast.

But when I got my first clusters, it still took me a bit to get my mind around what I really had. I knew I had some false positives and false negatives in these clusters so I wasn't sure how much I should trust them.

I knew I must have clusters that were almost duplicates - that had different hash signatures - but where the members were almost the same.

The way I explained this to myself is using this analogy: Imagine you have a bunch of people in a gymnasium and you want to arrange them into clusters. You decide that you're going to arrange them by marital status and zodiac sign.

So you put all the married Virgos into one cluster. But only some of the Virgos are in this cluster. There are some other Virgos in the single and divorced clusters also.

Does that mean your married virgo cluster is incomplete? No. It represents a cluster of some people in your group that happen to share the exact same attributes.

Even though the approach is probabilistic, sets that are exactly the same have a 100% chance of being in the same cluster.

Once I got to the point of analyzing the results, I then understood the importance of sample size (big data!) and cluster size.

These are the levers you have to create cluster data that is actionable.

If your clusters are too small, you can't draw many conclusions from them because your results may not be statistically significant.

If your clusters are too large, they may be too vague and not reflect the differences that still may exist between their members.

If you want to make your clusters bigger:

Use a larger data set
Use fewer grams in your shingles
Don't use shingles at all - make set sequence irrelevant
Use fewer hashes in the signature - throw fewer darts

It is true that some of the options have the effect of making the clusters less precise, but you have to go to war with the data you have. If your data isn't big enough, you may have to draw conclusions from less precision.

That's life.

At least you didn't have to buy a super computer.

Once I got my clusters to a reasonable size, I was able to calculate the frequency at which the underlying sessions required help from customer service.

I sorted the clusters by those that required the most customer hand holding (the red dots in the image) and brought those to our design team as feedback on the usability of these features.

Data Science!

Clustering similar web session paths

I have managed to live for quite awhile without having a strong need for data clustering. I always thought of it as something for marketers to push together folks with similar demographics or maybe to push movies into piles with similar plot characteristics like Netflix does.

But now that I am working on a big banking website, I have some large datasets to work with and some challenging questions have arisen.

For example, I wanted to know if there were certain types of experiences that website users have that are more likely to lead to customer service calls. If we can identify those experiences clearly, we can focus on them for improvements.

Website sessions are essentially an ordered set of requests for particular web pages. For example, consider two web sessions.

Homepage
Post Login
User Dashboard
Wire Money
Confirm Details
Success Message
Logout

Homepage
Post Login
User Dashboard
Read Inbox Messages
Wire Money
Confirm Details
Success Message

Both of these sessions are similar. Both users logged in and sent a wire transfer successfully. But their exact sequence wasn't precisely the same. The second user stopped to read inbox messages first and didn't bother to logout.

For my purposes, I would like to put these two sessions in the same category or cluster so that I can measure them for how likely they are to call in for service.

But how do we compare the sets for similarity?

Well I poked around and found that the most common way to compare sets is to generate a Jaccard similarity score. This is basically the intersection divided by the union. In other words, the number of set members that are the same between two sets divided by the total number of unique set members.

Using the first letter of each page in the example above to indicate a set member, the two sets are:

{H,P,U,W,C,S,L}
{H,P,U,R,W,C,S}

The intersection is {H,P,U,W,C,S} (6) and the union is {H,P,U,W,R,C,S,L} (8).

So the Jaccard similarity = 6/8 = 75%

Note that this score does not take sequence into consideration. If one user reads their inbox message before sending a wire and another user reads their inbox after sending a wire, their similarity score could still be 100%.

Whether this matters or not depends on what you're trying to learn.

But if sequence does matter, you can employ a method called shingling where your set members become sequences of members rather than individual members. The w, or the number of grams in your shingles is something you can vary for precision. But if we take a simple example where w=2, our sets would look like this:

{HP, PU, UW, WC, CS, SL}
{HP, PU, UR, RW, WC, CS}

The Jaccard score for these two sets is {HP*, PU*, UW, WC*, CS*, SL, UR, RW} = 4/8 = 50%

So now we have a way to compare two sets that is fairly simple.

But I have millions of web sessions to compare. To get them into clusters, I would have to compare every session to every other session to get a similarity score for every pair.

If I had one million sessions to compare, that would come out to one million squared divided by two - or half a trillion comparisons. That number is computationally challenging to say the least. And one million sessions may be too small of a sample set anyway. Although many of our sessions are very similar, the gold we are mining for are those sessions we don't know a lot about - the infrequent sessions - the ones that occur only every ten thousand sessions or less. To get a decent sample size, we many actually need an order of magnitude or more data than a mere million sessions.

I tried to come up with ways to cut down the number of comparisons, to no avail.

Our company has access to Aster which is a big data system with a number of commonly needed functions for situations like this. I found a promising function called minhash which sounded like it might be able to help, but I wanted to understand it inside and out before I used it.

The next article - Using minhash for set clustering describes how I managed to get over the computational hurdle for this problem.