Skip to content

PCA for Classification Is as Bad as Random

Basically, PCA (Principle Component Analysis) finds projection axes based on total population variance. Because that is not correlated with classes, that means that adding PCA into your classification pipeline is essentially adding a random variable. Or, more exactly, it's like using a set of random, orthogonal projection axes.

Here's a simple example of PCA making classification harder…

Load up the Iris data set, and define a function to calculate classification accuracy:

library(rpart)
data(iris)
d <- iris[, c(1, 4, 5)]
accuracy <- function (m) {
sum(predict(m, type="class") == d$Species) / length(m$y)
}

The data is pretty simple. We can see that the separating boundaries are already axis-aligned, with Petal.Width being a good discriminator.

plot(d[,1:2], pch=as.numeric(d$Species), col=d$Species)
legend("bottomright", levels(d$Species), pch=1:3, col=1:3)

iris.svg

A decision tree on the plain data does pretty well:

m <- rpart(Species ~ ., d)
plot(m, margin=0.1)
text(m, use.n=T)

plain.svg

Prediction accuracy is good:

accuracy(m)
0.96

Now, we apply PCA to the data and build a decision tree off that, limiting it to the same complexity as the first one:

p <- prcomp(~ Sepal.Length + Petal.Width, d)
d.pca <- as.data.frame(predict(p))
d.pca$Species <- d$Species
m <- rpart(Species ~ ., d.pca, maxdepth=2)
plot(m, margin=0.1)
text(m, use.n=T)

pca-unscaled.svg

It doesn't perform as well.

accuracy(m)
0.853333333333333

Even scaling the data before PCA's rotation doesn't really help.

p <- prcomp(~ Sepal.Length + Petal.Width, d, scale=T)
d.pca <- as.data.frame(predict(p))
d.pca$Species <- d$Species
m <- rpart(Species ~ ., d.pca, maxdepth=2)
plot(m, margin=0.1)
text(m, use.n=T)

pca-scaled.svg

accuracy(m)
0.873333333333333

Instead of using PCA, you should use methods designed for classification. Linear Discriminant Analysis will not only get you (non-orthogonal) projection axes that are optimized for classification, but will simultaneously give you the classification model itself. It does make some assumptions that may not fit your data. Consider Quadratic Discriminant Analysis in that case.

Even if you insist on decision trees or some other classification algorithm, but want to compress your variables to a smaller set, use LDA or QDA to get those axes, then go ahead and run your classifier of choice on that.

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

No comments

The author does not allow comments to this entry

Add Comment

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
To leave a comment you must approve it via e-mail, which will be sent to your address after submission.
Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Form options

Submitted comments will be subject to moderation before being displayed.