I ran up against a computational barrier because, even though I could easily compare the similarity of two sessions, comparing every session to every other session was an exponentially greater task.
I learned about an approach called minhash which is a shortcut to clustering. It creates clusters of sets that are probably similar. It trades accuracy for processing time.
After weeks of trying to get my mind around the approach I finally settled on an analogy that made the most sense to me.
Imagine a dart board in the shape of a venn diagram. The set members that belong to both sets are inside the football shaped intersection area.
You throw several darts at the dartboard and you are guaranteed to hit a single set member, whether inside or outside the intersection. But the member you hit is totally random. Pretend you're bad at darts.As it turns out, mathematically, the probability of each dart landing inside the intersection area is exactly the same as the Jaccard similarity score. This turns out to be very useful.
As you throw more darts - the more darts you throw that all land inside the intersection, the more likely that the sets are exactly the same.
The way the algorithm works is:
- Every possible set member in the universe is assigned a random hash value for each dart that is to be thrown. E.g. Hompage -> 4 darts -> hash1 = AABB, hash2 = 88EE, hash3 =22CC, hash4 = 21EC
- For each set (session), find the member of the set with the lowest (min) hash value (for the current iteration) and choose that member to represent the set (this is the dart hitting the board). Do this four times with the four different hash values.
- For each set, the min hash from the four darts becomes the signature of the set e.g. session 1 signature = AABB-88EE-BBCC-87EA
All of the sets (sessions) that have the same signature are part of the same cluster.
This is insanely fast because the signatures can all be assigned in one pass through the sets. There is no need to actually calculate the true Jaccard similarity for the sets.
Even though I had access to an Aster system with a minhash function, I didn't actually need it. Doing this in plain old SQL is super fast.
But when I got my first clusters, it still took me a bit to get my mind around what I really had. I knew I had some false positives and false negatives in these clusters so I wasn't sure how much I should trust them.
I knew I must have clusters that were almost duplicates - that had different hash signatures - but where the members were almost the same.
The way I explained this to myself is using this analogy: Imagine you have a bunch of people in a gymnasium and you want to arrange them into clusters. You decide that you're going to arrange them by marital status and zodiac sign.
So you put all the married Virgos into one cluster. But only some of the Virgos are in this cluster. There are some other Virgos in the single and divorced clusters also.
Does that mean your married virgo cluster is incomplete? No. It represents a cluster of some people in your group that happen to share the exact same attributes.
Even though the approach is probabilistic, sets that are exactly the same have a 100% chance of being in the same cluster.
Once I got to the point of analyzing the results, I then understood the importance of sample size (big data!) and cluster size.
These are the levers you have to create cluster data that is actionable.
If your clusters are too small, you can't draw many conclusions from them because your results may not be statistically significant.
If your clusters are too large, they may be too vague and not reflect the differences that still may exist between their members.
If you want to make your clusters bigger:
- Use a larger data set
- Use fewer grams in your shingles
- Don't use shingles at all - make set sequence irrelevant
- Use fewer hashes in the signature - throw fewer darts
That's life.
At least you didn't have to buy a super computer.
Once I got my clusters to a reasonable size, I was able to calculate the frequency at which the underlying sessions required help from customer service.
I sorted the clusters by those that required the most customer hand holding (the red dots in the image) and brought those to our design team as feedback on the usability of these features.
Data Science!
Once I got my clusters to a reasonable size, I was able to calculate the frequency at which the underlying sessions required help from customer service.
I sorted the clusters by those that required the most customer hand holding (the red dots in the image) and brought those to our design team as feedback on the usability of these features.
Data Science!
%2B1%2B(2).jpg)