But now that I am working on a big banking website, I have some large datasets to work with and some challenging questions have arisen.
For example, I wanted to know if there were certain types of experiences that website users have that are more likely to lead to customer service calls. If we can identify those experiences clearly, we can focus on them for improvements.
Website sessions are essentially an ordered set of requests for particular web pages. For example, consider two web sessions.
|
|
For my purposes, I would like to put these two sessions in the same category or cluster so that I can measure them for how likely they are to call in for service.
But how do we compare the sets for similarity?
Well I poked around and found that the most common way to compare sets is to generate a Jaccard similarity score. This is basically the intersection divided by the union. In other words, the number of set members that are the same between two sets divided by the total number of unique set members.
Using the first letter of each page in the example above to indicate a set member, the two sets are:
- {H,P,U,W,C,S,L}
- {H,P,U,R,W,C,S}
The intersection is {H,P,U,W,C,S} (6) and the union is {H,P,U,W,R,C,S,L} (8).
So the Jaccard similarity = 6/8 = 75%
Note that this score does not take sequence into consideration. If one user reads their inbox message before sending a wire and another user reads their inbox after sending a wire, their similarity score could still be 100%.
Whether this matters or not depends on what you're trying to learn.
But if sequence does matter, you can employ a method called shingling where your set members become sequences of members rather than individual members. The w, or the number of grams in your shingles is something you can vary for precision. But if we take a simple example where w=2, our sets would look like this:
- {HP, PU, UW, WC, CS, SL}
- {HP, PU, UR, RW, WC, CS}
The Jaccard score for these two sets is {HP*, PU*, UW, WC*, CS*, SL, UR, RW} = 4/8 = 50%
So now we have a way to compare two sets that is fairly simple.
But I have millions of web sessions to compare. To get them into clusters, I would have to compare every session to every other session to get a similarity score for every pair.
If I had one million sessions to compare, that would come out to one million squared divided by two - or half a trillion comparisons. That number is computationally challenging to say the least. And one million sessions may be too small of a sample set anyway. Although many of our sessions are very similar, the gold we are mining for are those sessions we don't know a lot about - the infrequent sessions - the ones that occur only every ten thousand sessions or less. To get a decent sample size, we many actually need an order of magnitude or more data than a mere million sessions.
I tried to come up with ways to cut down the number of comparisons, to no avail.
Our company has access to Aster which is a big data system with a number of commonly needed functions for situations like this. I found a promising function called minhash which sounded like it might be able to help, but I wanted to understand it inside and out before I used it.
The next article - Using minhash for set clustering describes how I managed to get over the computational hurdle for this problem.
But I have millions of web sessions to compare. To get them into clusters, I would have to compare every session to every other session to get a similarity score for every pair.
If I had one million sessions to compare, that would come out to one million squared divided by two - or half a trillion comparisons. That number is computationally challenging to say the least. And one million sessions may be too small of a sample set anyway. Although many of our sessions are very similar, the gold we are mining for are those sessions we don't know a lot about - the infrequent sessions - the ones that occur only every ten thousand sessions or less. To get a decent sample size, we many actually need an order of magnitude or more data than a mere million sessions.
I tried to come up with ways to cut down the number of comparisons, to no avail.
Our company has access to Aster which is a big data system with a number of commonly needed functions for situations like this. I found a promising function called minhash which sounded like it might be able to help, but I wanted to understand it inside and out before I used it.
The next article - Using minhash for set clustering describes how I managed to get over the computational hurdle for this problem.
No comments:
Post a Comment