They sometimes turn to us in the analytics group to give them data about how real users that fit these personas use the website. For example, how frequently to they log in and use certain features.
That would be easy to answer if we had each user labeled in the database with their corresponding persona. But we don't. It's their behavior that helps define their persona.
So the question is: how can we categorize real life users according to these personas using just our web log data?
We have been kicking this one around for awhile with little success.
We know that users wear multiple hats sometimes and so their behavior overlaps. We also have some data with hints about users such as their job title. But that data has turned out to be unreliable and incomplete. Titles don't always match real behavior.
I decided to approach the task as a machine learning classification problem. I would try to build a set of training data and build a model that could predict the proper classification.
After trying and failing to get a list of human-classified examples to train the model, I built a quick scoring system to find clear examples of prototype persona usage.
Example:
- Persona A
- Does more than average of task X (1 points)
- Does less than average of task Y (2 points)
- Has title Business Manager (2 points)
- Persona B
- Does more than average of task Y (2 points)
- Does less than average of task Z (1 point)
- Has title Accountant (2 points)
- Has title Clerk (1 point)
Then I had to pick a classification model.
Azure currently provides a number of models. I am still learning the strengths and weaknesses of each but this is what I understand from the most common:
- Multiclass Logistic Regression
- Best when your features are mainly numeric and have a linear or simple exponential relationship to one another
- Approaches estimating the right answer using algebraic formulas
- This type of model can be trained continuously one new example at a time
- Multiclass Decision Forest
- Best when many of your features are categorical rather than numeric
- Approaches estimating the right answer using a twenty questions type scheme
- Multiclass Neural Network
- Similar to logistic regression except it allows multiple layers and combinations of formulas
Although some models to work noticeably better than others in certain cases, the training data has the greatest impact on the quality of the predictions.
If your training data doesn't contain any real patterns that can be used to make good predictions, then good predictions just simply cannot be made.