2013-08-23

(This article was originally published at Understanding the Data game, and syndicated at StatsBlogs.)

Sometimes it's cool to discriminate. (I mean, if Abercrombie does it...) There I said it. I'm sick and tired of people being all PC(A) about it. I think Linear Discriminant Analysis is a great way to classify your data, and I don't care who knows it.

Sorry for the provocative intro, I needed to begin with a win. Now that I've got your attention, let's discriminate - nerd style.

Linear Discriminant Analysis (LDA) is a supervised learning technique that aims to reduce the dimensionality of your data set. Last post we looked at its attention hording cousin, Principle Component Analysis (an unsupervised learning approach), which was based on the idea that we could combine our variables in such a way that they could explain the most variability in our data. It does this by projecting the data onto these new dimensions. We could then remove the dimensions that explained the least, and be left with only the important, and as a result, fewer dimensions than we started.

LDA is another way to acheive a similar result. The goal of LDA is to reduce dimensionality yet perserve what's called class discrimination. It does this by finding combinations of variables that provide the greatest separation between classes of a variable. Using our wine quality data set again, we might be interested in what combinations of variables distinguish high quality wines from low quality wines. I'll briefly summarize the mechanics of this process below, but before doing so, we need to ask ourselves if LDA can be applied to our data set. It's very important for the analyst to know when to use a hammer and when to use a screw driver. It'll depend on what you're putting in the wall. It also might depend on the weight of the object that you're holding up. You should also check for studs, or else you'll likely need an anchor. And if you intend to remove the object shortly, don't use a screw, as a nail will be easier to remove and leave less residual wall damage.

Random tangent? Or fundamental (and thereby important) home maintenance information? Hmm...

One of the requirements for LDA to be applied is for the independent variables (that is, the alcohol, density, pH, ...) to follow a Gaussian distribution, or a bell-shaped curve. The reason for this assumption is out of the scope of this post, but I will note that variations of LDA do relax this requirement. So, can we reasonably apply LDA to this data set? How can we check if the variables are normally distributed?

Histograms of course! Just like you learned in elementary school, between visits from your childhood idols.



We see that many of the variables are bell-shaped, so we forge ahead with the LDA analysis. You might think that the graph above is useless but it's important to understand if it makes sense to apply certain techniques before actually doing so. Analyzing the distribution of your variables is an important first step in exploratory data analysis.

+++++
Nerd Alert!! The following can be skimmed without loss of continuity
+++++

The keen among you will have noticed that I graphed the above variables on the same chart. And if you recall from the PCA post, I mentioned that it's necessary to scale the variables in order to compare them apples to apples, to keep from confusing the results. Alternatively, I could have drawn a histogram for each variable on it's own scale, but when I scale/normalize the variables, I can compare them on a similar axis. This allows me to neatly draw each of their histograms on one chart. As I said before, feature scaling is a common Machine Learning technique.

Before applying an LDA, let's define what we mean by separation so that we're all on the same page. The Oxford Dictionary defines separation as: "the action or state of moving or being moved apart" and "the division of something into constituent or distinct elements".

How do we define separation? We'll define it as the ratio of the between-group variance to the within-group variance of a set of variables. Simple and intuitive isn't it?

Between-group variance answers the question: how much does alcohol vary between low quality wines and high quality wines.

Within-group variance answers the question: of the low quality wines, how much does alcohol vary?

Catch the difference? It's subtle yet very important to LDA, and more generally to the branch of statistics known as discrete multivariate analysis. This concept is known as class separability.

And for you gluttons out there, here's what between-group variance looks like for C classes:



And here's what within-group variance it looks like:



The goal of LDA is to maximize the between-group variance and minimize the within-group variance, such that the following objective function is maximized:

by solving the following equation:

I'll note that the V's from above are the generalized eigenvectors. There I said it.

To solve that involves some linear algebra, so I won't bore you with the details. But keep in mind that in today's age of computers, this stuff has amounted to 1 line of code. That's not to say that you shouldn't understand the mechanics of it when implementing it, just that you don't need to reinvent the wheel every time you do some complex analysis.

+++++++

End Nerd Alert. You've earned a glass of wine.
+++++++

The sepration ratio for a given variable (say alcohol) tells us how different alcohol is for low quality and high quality wines. Therefore, based on the level of alcohol in a given wine, we're able to classify it as low or high quality. At least in theory. And practice. At least in both.

What LDA attempts to do is combine our variables in such a way as to maximize this separation. The idea is that a (linear in LDA's case) combination of variables will separate low from high quality wines better than any individual variable can. But don't take my word from it. Actually do. You'll save yourself the trouble of doing the math from above and be happy with your decision. Plus more time for the important stuff.

Assuming you went rogue and didn't take my word for it, below is the separation acheived by each variable individually.

It looks like alcohol provides the greatest separation of quality scores with a value of 320. Don't worry about what the number is in absolute terms, just notice that it's bigger than the others. Great.

At this point you might be asking yourself "What's for dinner?". You might also be asking yourself "Dataman, are you telling me that LDA can yield a separation greater than the almighty alcohol can?" I bet when you began reading this post you didn't think that was possible (you probably also didn't know that was a question you'd be asking...) Well guess what?

 

Nice. The first Linear Discriminant achieves a separation of 500, which is greater than alcohol did on its own. Note that there are only 6 Linear Discriminants because we are separating 7 classes of wine quality (from 3 to 9, inclusive). Whenever you perform an LDA, you'll end up with 1 less discriminant than you have groups that you're separating. This is because there are C-1 distinct solutions to the eigenvalue problem from above. But you knew that. :)

Following that logic, if we wanted to separate between red and white wines, for example, we'd have 1 Linear Discriminant (then again for 2 classes we wouldn't need this fancy matrix math, but that's orthogonal to the point)

The histogram below is an interesting way to visualize the separation achieved by the (only) Linear Discriminant, with respect to red and white wines.  As you can hopefully see, there exists a certain subset of values for which most of the reds lie, and a certain subset of values for which most of the white's lie. That value incidentally is -1.59, which is just the average of the midpoints of each histogram. Not super fancy is it?

This looks like it'll classify reds from whites pretty well, but we can't solely rely on visual inspection to diagnose our model accuracy. We can assess accuracy using what is (poorly, in my opinion) called a confusion matrix. Un-confusingly, this is a table which counts how many of our classifications were classified correctly and how many were classified incorrectly. Not super complex huh? If it is, then have another glass of wine and it'll make more sense.

Getting back to our wine quality scores, you might be wondering what the confusion matrix is. Voila.

To assess accuracy, simply add the numbers along the diagonal (aka the trace of the matrix) and divide by the total number of wines. Our accuracy at correctly classifying any given wine into their respective quality bucket is  53.8%. That might not seem pretty high but consider that it's orders of magnitude better than had we simply guessed at classifying 6000+ wines into one of seven categories. For some perspective, this is better than if we blindfolded a monkey and sat him atop a spinning lazy susan as he tried to classify each wine as red or white (i.e. only 2 categories) by throwing his...opinion...at each bottle. Of course I'm presupposing that this monkey doesn't know the prior distribution of red and white wines, but how many blindfolded monkey's actually do? I mean, they're blindfolded. What do they know? (I would caution, however, against underestimating the  technical prowess  of our fellow primates). By the way, the point I was trying to make in the preceding example could have been less verbosely stated as someone tossing a coin and having a 50/50 chance at guessing the outcome. However, the spinning blindfolded monkey conjures up a much more interesting picture, don't you agree?

I'd like to make one last point before signing off. And that is the concept of validating a model's accuracy. Basically, the way I did it above is disingenuous of me. That's because this accuracy measure was determined based on a model which was calibrated to this data set. Naturally, we'd expect our model to be very accurate. After all, we've fine tuned this model to fit the data that we're now assessing it on. But would it work well on new data? A more meaningful assessment of accuracy would be to calibrate the model to some data, and then test the accuracy on a completely new, unseen, set of data. And yes, the following paragraph explains this idea further.

The data used in model calibration and building is called the training set. The data used in testing a model's accuracy is called the test set. These two sets of data should be completely separate. What I did above, testing the model on the training set, is a big no-no in data analysis. It's essentially self-serving. In practice, the training set is some random subset, usually 20%, of the total data set. The test set could be the remaining data which is then used to test the model. The goal is the build a robust model which fits new data, well. When this is the case we say that the model generalizes well to new data. 

The preceding few paragraphs covered two key topics in data analysis and model validation - confusion matrices and testing/training sets. There's much more to say about these concepts (such as cross validation and ROC curves), but I'll cover them within the context of other data analysis techniques in future posts. Let this serve as an introduction. 

Before signing off I'd like to note that we could potentially get more accurate results if we applied a PCA to these groupings to reduce the number of dimensions and then use LDA to classify them more accurately based on the remaining dimensions. That is, we could use  both  PCA and LDA together to solve a problem. It doesn't have to be either/or. No  discrimination  necessary. And we come back full circle.

Let the above sink in. Maybe re-read it. Maybe have another glass of wine. If you've been following along, you should be finished the bottle by now. In today's post I covered a lot of ground, so I'd like to summarize my points briefly:

Linear Discriminant Analysis is a dimensionality reduction and classification technique that is based on finding linear combinations of variables such that these combinations create the most separation between the dependent variable. We say that we retain class discrimination. It only works when the dependent variable is categorical (group A, group B, group C) as oppose to continuous (e.g. temperature, or height). Our version of LDA also works only when the independent variables are normally distributed.

Classification algorithms, such as PCA and LDA, can be assessed using a confusion matrix. They can also be complementary techniques.

Blindfolded monkeys don't have a good sense of a data set's prior distribution. Idiots.

Models are built using a training set  and validated using the test set . 

As always, please leave me any comments or questions you have. 

Until next time!

Please comment on the article here: Understanding the Data game

The post Discrimination appeared first on All About Statistics.

Show more