Gordon Data Science

Tuesday, December 15, 2015

How to find web sessions that match marketing user personas

Our user experience group has developed user personas to represent different types of people who use our website. These personas are fictional characters with fictional names, jobs, and personalities.

They sometimes turn to us in the analytics group to give them data about how real users that fit these personas use the website. For example, how frequently to they log in and use certain features.

That would be easy to answer if we had each user labeled in the database with their corresponding persona. But we don't. It's their behavior that helps define their persona.

So the question is: how can we categorize real life users according to these personas using just our web log data?

We have been kicking this one around for awhile with little success.

We know that users wear multiple hats sometimes and so their behavior overlaps. We also have some data with hints about users such as their job title. But that data has turned out to be unreliable and incomplete. Titles don't always match real behavior.

I decided to approach the task as a machine learning classification problem. I would try to build a set of training data and build a model that could predict the proper classification.

After trying and failing to get a list of human-classified examples to train the model, I built a quick scoring system to find clear examples of prototype persona usage.

Example:

Persona A

Does more than average of task X (1 points)
Does less than average of task Y (2 points)
Has title Business Manager (2 points)

Persona B

Does more than average of task Y (2 points)
Does less than average of task Z (1 point)
Has title Accountant (2 points)
Has title Clerk (1 point)

Then I cherry picked those users who scored the highest for each persona and used those as training data which I uploaded to a new Azure ML experiment.

Then I had to pick a classification model.

Azure currently provides a number of models. I am still learning the strengths and weaknesses of each but this is what I understand from the most common:

Multiclass Logistic Regression

Best when your features are mainly numeric and have a linear or simple exponential relationship to one another
Approaches estimating the right answer using algebraic formulas
This type of model can be trained continuously one new example at a time

Multiclass Decision Forest

Best when many of your features are categorical rather than numeric
Approaches estimating the right answer using a twenty questions type scheme

Multiclass Neural Network

Similar to logistic regression except it allows multiple layers and combinations of formulas

The handy thing about using Azure ML is that it is super simple to test how well these different types of models work through quick trial and error. Whichever works the best with your training data is the one to use.

Although some models to work noticeably better than others in certain cases, the training data has the greatest impact on the quality of the predictions.

If your training data doesn't contain any real patterns that can be used to make good predictions, then good predictions just simply cannot be made.

Monday, August 3, 2015

What is machine learning?

I came to the machine learning table kind of late in the game.

I have been working closely with data for many years - using it to operate business systems and report results to executives.

In the past when I heard the term machine learning, it conjured up images of university labs studying artificial intelligence. For a while, that was probably a pretty accurate image. But bits and pieces of the work done over the decades has resulted in a growing set of tools that are being used today in a number of everyday applications.

But the term machine learning doesn't really mean artificial intelligence per se.

Well then what is it?

Machine learning is basically just a catchy brand name for a variety of sophisticated statistical models that, given some input data, can spit out an educated guess at a pre-defined question.

At a high level, they do one of three things:

Classification - Guessing a Category

Given some input data like text, images, or descriptive features, the system will guess what category the input belongs in
Common examples are spam detection and optical character recognition

Regression - Guessing a Number

Given some input data, the system will return a useful number
The useful number could be the price a house will likely sell for, the right price to sell a stock, a guess at the rating a person will give a particular movie

Clustering - Guessing Similarity

Given some input data containing a variety of features, the system can push all of the data into a pre-defined number of piles based on what seems most similar
These piles don't have names because the system doesn't necessarily have any preconceptions about categories

Most machine learning algorithms fall into one of those categories.

How do the algorithms know how to make their guesses?

Classification and regression problems require training data. That is, as many examples as possible with sample inputs and correct answers. This is also referred to as supervised learning.

Generally speaking, the more training data you have, the better your guesses will be. That's why the term big data has taken off. In machine learning, more is better.

Clustering is an example of unsupervised learning. Training data is great if you have it. But if the right answers aren't known at the beginning of the exercise, clustering is a good way to mine your input data for similarities. From there, you can examine the output to see if the patterns make sense and potentially create training data from there.

After learning some of the basics, I began my search for tools that might help me explore these approaches.

In the past, statistical tools like Matlab, R, or Python have been the tools of choice. But these are merely libraries of individual statistical functions. A developer has to string together code to examine data and test models.

I recently came upon Microsoft's entry into the Machine Learning arena. Their Azure Machine Learning Studio is a good place to start for beginners in this area. Many of the most popular algorithms are made available for free with a drag and drop interface.

I'll be evaluating each of its features and explaining what makes them useful here on this blog.

Stay tuned!

Sunday, February 22, 2015

Using minhash for set clustering

In my last post, I explained that I was trying to push web sessions into similar clusters so that I could figure out which user experiences required more customer service help.

I ran up against a computational barrier because, even though I could easily compare the similarity of two sessions, comparing every session to every other session was an exponentially greater task.

I learned about an approach called minhash which is a shortcut to clustering. It creates clusters of sets that are probably similar. It trades accuracy for processing time.

After weeks of trying to get my mind around the approach I finally settled on an analogy that made the most sense to me.

Imagine a dart board in the shape of a venn diagram. The set members that belong to both sets are inside the football shaped intersection area.

You throw several darts at the dartboard and you are guaranteed to hit a single set member, whether inside or outside the intersection. But the member you hit is totally random. Pretend you're bad at darts.

As it turns out, mathematically, the probability of each dart landing inside the intersection area is exactly the same as the Jaccard similarity score. This turns out to be very useful.

As you throw more darts - the more darts you throw that all land inside the intersection, the more likely that the sets are exactly the same.

The way the algorithm works is:

Every possible set member in the universe is assigned a random hash value for each dart that is to be thrown. E.g. Hompage -> 4 darts -> hash1 = AABB, hash2 = 88EE, hash3 =22CC, hash4 = 21EC
For each set (session), find the member of the set with the lowest (min) hash value (for the current iteration) and choose that member to represent the set (this is the dart hitting the board). Do this four times with the four different hash values.
For each set, the min hash from the four darts becomes the signature of the set e.g. session 1 signature = AABB-88EE-BBCC-87EA

All of the sets (sessions) that have the same signature are part of the same cluster.

This is insanely fast because the signatures can all be assigned in one pass through the sets. There is no need to actually calculate the true Jaccard similarity for the sets.

Even though I had access to an Aster system with a minhash function, I didn't actually need it. Doing this in plain old SQL is super fast.

But when I got my first clusters, it still took me a bit to get my mind around what I really had. I knew I had some false positives and false negatives in these clusters so I wasn't sure how much I should trust them.

I knew I must have clusters that were almost duplicates - that had different hash signatures - but where the members were almost the same.

The way I explained this to myself is using this analogy: Imagine you have a bunch of people in a gymnasium and you want to arrange them into clusters. You decide that you're going to arrange them by marital status and zodiac sign.

So you put all the married Virgos into one cluster. But only some of the Virgos are in this cluster. There are some other Virgos in the single and divorced clusters also.

Does that mean your married virgo cluster is incomplete? No. It represents a cluster of some people in your group that happen to share the exact same attributes.

Even though the approach is probabilistic, sets that are exactly the same have a 100% chance of being in the same cluster.

Once I got to the point of analyzing the results, I then understood the importance of sample size (big data!) and cluster size.

These are the levers you have to create cluster data that is actionable.

If your clusters are too small, you can't draw many conclusions from them because your results may not be statistically significant.

If your clusters are too large, they may be too vague and not reflect the differences that still may exist between their members.

If you want to make your clusters bigger:

Use a larger data set
Use fewer grams in your shingles
Don't use shingles at all - make set sequence irrelevant
Use fewer hashes in the signature - throw fewer darts

It is true that some of the options have the effect of making the clusters less precise, but you have to go to war with the data you have. If your data isn't big enough, you may have to draw conclusions from less precision.

That's life.

At least you didn't have to buy a super computer.

Once I got my clusters to a reasonable size, I was able to calculate the frequency at which the underlying sessions required help from customer service.

I sorted the clusters by those that required the most customer hand holding (the red dots in the image) and brought those to our design team as feedback on the usability of these features.

Data Science!

Clustering similar web session paths

I have managed to live for quite awhile without having a strong need for data clustering. I always thought of it as something for marketers to push together folks with similar demographics or maybe to push movies into piles with similar plot characteristics like Netflix does.

But now that I am working on a big banking website, I have some large datasets to work with and some challenging questions have arisen.

For example, I wanted to know if there were certain types of experiences that website users have that are more likely to lead to customer service calls. If we can identify those experiences clearly, we can focus on them for improvements.

Website sessions are essentially an ordered set of requests for particular web pages. For example, consider two web sessions.

Homepage
Post Login
User Dashboard
Wire Money
Confirm Details
Success Message
Logout

Homepage
Post Login
User Dashboard
Read Inbox Messages
Wire Money
Confirm Details
Success Message

Both of these sessions are similar. Both users logged in and sent a wire transfer successfully. But their exact sequence wasn't precisely the same. The second user stopped to read inbox messages first and didn't bother to logout.

For my purposes, I would like to put these two sessions in the same category or cluster so that I can measure them for how likely they are to call in for service.

But how do we compare the sets for similarity?

Well I poked around and found that the most common way to compare sets is to generate a Jaccard similarity score. This is basically the intersection divided by the union. In other words, the number of set members that are the same between two sets divided by the total number of unique set members.

Using the first letter of each page in the example above to indicate a set member, the two sets are:

{H,P,U,W,C,S,L}
{H,P,U,R,W,C,S}

The intersection is {H,P,U,W,C,S} (6) and the union is {H,P,U,W,R,C,S,L} (8).

So the Jaccard similarity = 6/8 = 75%

Note that this score does not take sequence into consideration. If one user reads their inbox message before sending a wire and another user reads their inbox after sending a wire, their similarity score could still be 100%.

Whether this matters or not depends on what you're trying to learn.

But if sequence does matter, you can employ a method called shingling where your set members become sequences of members rather than individual members. The w, or the number of grams in your shingles is something you can vary for precision. But if we take a simple example where w=2, our sets would look like this:

{HP, PU, UW, WC, CS, SL}
{HP, PU, UR, RW, WC, CS}

The Jaccard score for these two sets is {HP*, PU*, UW, WC*, CS*, SL, UR, RW} = 4/8 = 50%

So now we have a way to compare two sets that is fairly simple.

But I have millions of web sessions to compare. To get them into clusters, I would have to compare every session to every other session to get a similarity score for every pair.

If I had one million sessions to compare, that would come out to one million squared divided by two - or half a trillion comparisons. That number is computationally challenging to say the least. And one million sessions may be too small of a sample set anyway. Although many of our sessions are very similar, the gold we are mining for are those sessions we don't know a lot about - the infrequent sessions - the ones that occur only every ten thousand sessions or less. To get a decent sample size, we many actually need an order of magnitude or more data than a mere million sessions.

I tried to come up with ways to cut down the number of comparisons, to no avail.

Our company has access to Aster which is a big data system with a number of commonly needed functions for situations like this. I found a promising function called minhash which sounded like it might be able to help, but I wanted to understand it inside and out before I used it.

The next article - Using minhash for set clustering describes how I managed to get over the computational hurdle for this problem.