I have been working closely with data for many years - using it to operate business systems and report results to executives.
In the past when I heard the term machine learning, it conjured up images of university labs studying artificial intelligence. For a while, that was probably a pretty accurate image. But bits and pieces of the work done over the decades has resulted in a growing set of tools that are being used today in a number of everyday applications.
But the term machine learning doesn't really mean artificial intelligence per se.
Well then what is it?
Machine learning is basically just a catchy brand name for a variety of sophisticated statistical models that, given some input data, can spit out an educated guess at a pre-defined question.
At a high level, they do one of three things:
- Classification - Guessing a Category
- Given some input data like text, images, or descriptive features, the system will guess what category the input belongs in
- Common examples are spam detection and optical character recognition
- Regression - Guessing a Number
- Given some input data, the system will return a useful number
- The useful number could be the price a house will likely sell for, the right price to sell a stock, a guess at the rating a person will give a particular movie
- Clustering - Guessing Similarity
- Given some input data containing a variety of features, the system can push all of the data into a pre-defined number of piles based on what seems most similar
- These piles don't have names because the system doesn't necessarily have any preconceptions about categories
Most machine learning algorithms fall into one of those categories.
How do the algorithms know how to make their guesses?
Classification and regression problems require training data. That is, as many examples as possible with sample inputs and correct answers. This is also referred to as supervised learning.
Generally speaking, the more training data you have, the better your guesses will be. That's why the term big data has taken off. In machine learning, more is better.
Clustering is an example of unsupervised learning. Training data is great if you have it. But if the right answers aren't known at the beginning of the exercise, clustering is a good way to mine your input data for similarities. From there, you can examine the output to see if the patterns make sense and potentially create training data from there.
After learning some of the basics, I began my search for tools that might help me explore these approaches.
In the past, statistical tools like Matlab, R, or Python have been the tools of choice. But these are merely libraries of individual statistical functions. A developer has to string together code to examine data and test models.
I recently came upon Microsoft's entry into the Machine Learning arena. Their Azure Machine Learning Studio is a good place to start for beginners in this area. Many of the most popular algorithms are made available for free with a drag and drop interface.
I'll be evaluating each of its features and explaining what makes them useful here on this blog.
Stay tuned!