PCA for Classification Is as Bad as Random
Basically, PCA (Principle Component Analysis) finds projection axes based on total population variance. Because that is not correlated with classes, that means that adding PCA into your classification pipeline is essentially adding a random variable. Or, more exactly, it's like using a set of random, orthogonal projection axes.
Here's a simple example of PCA making classification harder…
Load up the Iris data set, and define a function to calculate classification accuracy:
library(rpart) data(iris) d <- iris[, c(1, 4, 5)] accuracy <- function (m) { sum(predict(m, type="class") == d$Species) / length(m$y) }
The data is pretty simple. We can see that the separating boundaries are already axis-aligned, with Petal.Width
being a good discriminator.
plot(d[,1:2], pch=as.numeric(d$Species), col=d$Species)
legend("bottomright", levels(d$Species), pch=1:3, col=1:3)
A decision tree on the plain data does pretty well:
m <- rpart(Species ~ ., d)
plot(m, margin=0.1)
text(m, use.n=T)
Prediction accuracy is good:
accuracy(m)
0.96
Now, we apply PCA to the data and build a decision tree off that, limiting it to the same complexity as the first one:
p <- prcomp(~ Sepal.Length + Petal.Width, d) d.pca <- as.data.frame(predict(p)) d.pca$Species <- d$Species m <- rpart(Species ~ ., d.pca, maxdepth=2) plot(m, margin=0.1) text(m, use.n=T)
It doesn't perform as well.
accuracy(m)
0.853333333333333
Even scaling the data before PCA's rotation doesn't really help.
p <- prcomp(~ Sepal.Length + Petal.Width, d, scale=T) d.pca <- as.data.frame(predict(p)) d.pca$Species <- d$Species m <- rpart(Species ~ ., d.pca, maxdepth=2) plot(m, margin=0.1) text(m, use.n=T)
accuracy(m)
0.873333333333333
Instead of using PCA, you should use methods designed for classification. Linear Discriminant Analysis will not only get you (non-orthogonal) projection axes that are optimized for classification, but will simultaneously give you the classification model itself. It does make some assumptions that may not fit your data. Consider Quadratic Discriminant Analysis in that case.
Even if you insist on decision trees or some other classification algorithm, but want to compress
your variables to a smaller set, use LDA or QDA to get those axes, then go ahead and run your classifier of choice on that.
Trackbacks
The author does not allow comments to this entry
Comments
Display comments as Linear | Threaded