PCA for Classification Is as Bad as Random

Posted by David Owen on 2024-11-27

Basically, PCA (Principle Component Analysis) finds projection axes based on total population variance. Because that is not correlated with classes, that means that adding PCA into your classification pipeline is essentially adding a random variable. Or, more exactly, it's like using a set of random, orthogonal projection axes.

Here's a simple example of PCA making classification harder…

Load up the Iris data set, and define a function to calculate classification accuracy:

library(rpart)
data(iris)
d <- iris[, c(1, 4, 5)]
accuracy <- function (m) {
sum(predict(m, type="class") == d$Species) / length(m$y)
}

The data is pretty simple. We can see that the separating boundaries are already axis-aligned, with Petal.Width being a good discriminator.

plot(d[,1:2], pch=as.numeric(d$Species), col=d$Species)
legend("bottomright", levels(d$Species), pch=1:3, col=1:3)

A decision tree on the plain data does pretty well:

m <- rpart(Species ~ ., d)
plot(m, margin=0.1)
text(m, use.n=T)

Prediction accuracy is good:

accuracy(m)

0.96

Now, we apply PCA to the data and build a decision tree off that, limiting it to the same complexity as the first one:

p <- prcomp(~ Sepal.Length + Petal.Width, d)
d.pca <- as.data.frame(predict(p))
d.pca$Species <- d$Species
m <- rpart(Species ~ ., d.pca, maxdepth=2)
plot(m, margin=0.1)
text(m, use.n=T)

It doesn't perform as well.

accuracy(m)

0.853333333333333

Even scaling the data before PCA's rotation doesn't really help.

p <- prcomp(~ Sepal.Length + Petal.Width, d, scale=T)
d.pca <- as.data.frame(predict(p))
d.pca$Species <- d$Species
m <- rpart(Species ~ ., d.pca, maxdepth=2)
plot(m, margin=0.1)
text(m, use.n=T)

accuracy(m)

0.873333333333333

Instead of using PCA, you should use methods designed for classification. Linear Discriminant Analysis will not only get you (non-orthogonal) projection axes that are optimized for classification, but will simultaneously give you the classification model itself. It does make some assumptions that may not fit your data. Consider Quadratic Discriminant Analysis in that case.

Even if you insist on decision trees or some other classification algorithm, but want to compress your variables to a smaller set, use LDA or QDA to get those axes, then go ahead and run your classifier of choice on that.

David Owen’s blog

PCA for Classification Is as Bad as Random

Trackbacks

Comments

Add Comment