## Principal Component Analysis (PCA) vs Ordinary Least Squares (OLS): A Visual Explanation

Over at stats.stackexchange.com recently, a really interesting question was raised about principal component analysis (PCA). The gist was “Thanks to my college class I can do the math, but what does it **MEAN**?”

I felt like this a number of times in my life. Many of my classes were focused on the technical implementations they kinda missed the section titled “Why I give a shit.” A perfect example was my Mathematics Principles of Economics class which taught me how to manually calculate a bordered Hessian but, for the life of me, I have no idea why I would ever want to calculate such a monster. OK, that’s a lie. Later in life I learned that bordered Hessian matrices are a second derivative test used in some optimizations. Not that I would EVER do that shit by hand. I’d use some R package and blindly trust that it was coded properly.

So back to PCA: as I was reading the aforementioned stats question I was reminded of a recent presentation that Paul Teetor gave at a August Chicago R User Group. In his presentation on spread trading with R he showed a graphic that illustrated the difference between OLS and PCA. I took some notes and went home and made sure I could recreate the same thing. If you have wondered what makes OLS and PCA different, open up an R session and play along.

**Your Independent Variable Matters:**

The first observation to make is that regressing x ~ y is not the same as y ~ x even in a simple univariate regression. You can illustrate this by doing the following:

set.seed(2)

x <- 1:100y <- 20 + 3 * x

e <- rnorm(100, 0, 60)

y <- 20 + 3 * x + eplot(x,y)

yx.lm <- lm(y ~ x)

lines(x, predict(yx.lm), col=”red”)xy.lm <- lm(x ~ y)

lines(predict(xy.lm), y, col=”blue”)

You should get something that looks like this:

So it’s obvious they give different lines. But why? Well, OLS minimizes the error between the dependent and the model. Two of these errors are illustrated for the y ~ x case in the following picture:

But when we flip the model around and regress x ~ y then OLS minimizes these errors:

Ok, so what about PCA?

Well let’s draw the first principal component the old school way:

#normalize means and cbind together

xyNorm <- cbind(x=x-mean(x), y=y-mean(y))

plot(xyNorm)#covariance

xyCov <- cov(xyNorm)

eigenValues <- eigen(xyCov)$values

eigenVectors <- eigen(xyCov)$vectorsplot(xyNorm, ylim=c(-200,200), xlim=c(-200,200))

lines(xyNorm[x], eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x])

lines(xyNorm[x], eigenVectors[2,2]/eigenVectors[1,2] * xyNorm[x])# the largest eigenValue is the first one

# so that’s our principal component.

# but the principal component is in normalized terms (mean=0)

# and we want it back in real terms like our starting data

# so let’s denormalize it

plot(xy)

lines(x, (eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x]) + mean(y))

# that looks right. line through the middle as expected# what if we bring back our other two regressions?

lines(x, predict(yx.lm), col=”red”)

lines(predict(xy.lm), y, col=”blue”)

PCA minimizes the error orthogonal (perpendicular) to the model line. So first principal component looks like this:

The two yellow lines, as in the previous images, examples of two of the errors which the routine minimizes.

So if we plot all three lines on the same scatter plot we can see the differences:

The x ~ y OLS and the first principal component are pretty close, but click on the image to get a better view and you will see they are not exactly the same.

All the code from the above examples can be found in a gist over at GitHub.com. It’s best to copy and past from the github as sometimes Wordpress molests my quotes and breaks the codez.

The best introduction to PCA which I have read is the one I link to on Stats.StackExchange.com. It’s titled “A Tutorial on Principal Components Analysis” by Lindsay I Smith.

Thanks for the well-written article, J.D. Thanks, too, for the shout-out. I wish I could take full credit for my recent, brilliant presentation at the Chicago RUG Meet-Up. But, actually, the ideas regarding PCA were taken from the web site of Vincent Zoonekynd (http://zoonek.free.fr/). His web site has a excellent discussion of regression, its asymetry, and the use of PCA (http://zoonek2.free.fr/UNIX/48_R/09.html#7).

If you have the same amount of variability on the x and y axes, the pca line should go through the middle of the other two.

Thanks for the link, Paul. I really like the examples on Zoonekynd’s site. It helps me so much to be able to play with code.

Hadley, that makes perfect sense. Although I hadn’t thought of it!

PS – Should have mentioned: This type of regression is sometimes called “orthogonal regression” or “total least squares” (TLS). (Knowing the names is useful for googling.) Doing the PCA gives the point estimates for the regression coefficients, of course, but I’d like to have the regression stats and the coefficient’s confidence intervals, too. So far, I haven’t found the necessary mathematics, much less found the code; i.e., no TLS package in R.

If you think of the pair (x,y) as coming from a bivariate ditribution, and you draw, say, the 95% prediction ellipse, then the PCA line is the major axis of the ellipse. A lot of these ideas were discovered by Galton (waayyyyy back in 1886), but they are neat to rediscover. You can also interpret the correlation between x and y graphically from the scatter plot. See Fig. 1 of Friendly, (JCGS, June 2007) at http://www.datavis.ca/papers/jcgs-heplots.pdf

Rick, you are exactly right. Certainly nothing here is a discovery. The best I hope for is to communicate a mostly rational explanation

Great article J.D! Helped me understand what is actually going on

Typo: I was playing along at home with R and you have a line in your code that is:

plot(xy)

I think it should be:

plot(x,y)

And thank you for the nice explanation, code, and visuals! Very, very enlightening!

Matt, thanks for pointing out my typo! I’ve fixed it in the blog post.

Not to nitpick, but the typo is repeated in the second plot.

Paul: I’ve noticed that other commercial software packages like Matlab also rely on PCA for TLS regression, so R is certainly not alone in this respect (for what it’s worth!).

[...] Cerebral Mastication: CPA versus OLS: A visual explanation [...]

I maybe wrong, but I thought “orthogonal regression” or “total least squares” were both synonyms of what in some fields are called either “reduced (or standardized) major axis regression” or “model II regression”. These are techniques designed for ‘error-in-variables’ – i.e., when the x variable is not ‘fixed’ by the investigator.

If i’m right in suggesting that they are the same thing then there are two packages in R that implement these models, with full coefficients and confidence intervals etc.: lmodel2 and smatr.

Wonderful beat ! I wish to apprentice while you amend your website, how could i subscribe for a blog website? The account helped me a acceptable deal. I had been tiny bit acquainted of this your broadcast provided bright clear idea

Typo in the line after the first picture. “Two of these errors are illustrated for the x ~ y case in the following picture” Should be “y ~ x”, as that is the red line.

Nice post, interesting stuff and appreciate the “sitting on the couch with a beer” tone of your writing ^_^

Matt, thanks for catching the typo!

I enjoy writing in the tone of “on the couch with a beer.” I find it helps simple things seem simple and complex things seem approachable. Writing formally as if in a journal article tends to make simple things seem obfuscated and complex things impossible… at least that’s my experience. Thanks for reading.

Thanks very much for the comparison with regression. Very helpful. This kind of comparison very useful for organizing and synthesizing information, but is often left out of introductory discussions.

To better illustrate the difference between PCA and OLS on y~x, you should scale the variables to make them the same order (e.g. divide them by the sd).

At the moment, y matters more, since the errors over y are larger than the ones over x, so the PCA is closer to y~x.

There is a small but interesting math error in Smith: “Lastly, all the eigenvectors of a matrix are perpendicular, ie. at right angles to each other, no matter how many dimensions you have. By the way, another word for perpendicular, in maths talk, is orthogonal. This is important because it means that you can express the data in terms of these perpendicular eigenvectors, instead of expressing them in terms of the x and y axes”

Actually you can work with non-perpendicular axes – doing so is a pain in the butt, but so long as they are linearly independent (for 2 vectors in 2D space “independent” boils down to “not parallel”, more complicated in 3D+) then the eigenvectors can be used as the basis for a coordinate system.

But perpendicular axes would help! However, Smith’s comment that “all eigenvectors are perpendicular” only applies to what are called normal matrices. Here’s a non-normal matrix which has them 45 degrees apart. But it’s {{1,2},{0,3}} and fortunately there’s no possible data for which this is a covariance matrix (can you spot why not?).

In fact, it’s even worse than Smith implies. Not all real

n x nmatrices even have real eigenvalues: for instance {{0,1},{-1,0}} has imaginary eigenvalues so doesn’t have real eigenvectors. And of those that do have real eigenvalues, not all havenindependent eigenvectors e.g. {{1,1},{0,1}} has a repeated eigenvalue with only a one dimensional eigenspace, so its eigenvectors can’t be used as a coordinate system for the x,y plane.So how do we make sure that there is a full set of real eigenvectors, that can form a basis for a coordinate system for our data, and which are conveniently orthogonal (i.e. mutually perpendicular) to boot? The great news is something that Smith pointed out earlier – the covariance matrix is a symmetric matrix. What Smith doesn’t note is that the eigenvectors of a symmetric matrix have all those lovely properties because symmetric matrices have real eigenvalues (as they’re a subset of the Hermitian matrices) and a full set of

northogonal eigenvectors (because Hermitian matrices are in turn a subset of the normal matrices)! We already saw that {{1,2},{0,3}} has inconveniently non-perpendicular eigenvectors, but its more symmetric cousin {{1,2},{2,3}} works just fine.I know this departs rather from the “what’s the point” tone of this blog, but this quibble apart I recommend the Smith piece – I just know “properties of matrices” is something that causes trouble for my students so can imagine this bit sowing confusion! Hopefully what I’ve managed to explain “under the hood” is why we know the eigenvectors will come out perpendicular, and conversely what the Big Deal is about the covariance matrix being symmetric.

Hello,

As far as I know, the Total Least Squares minimizes the euclidean distance.

You also say the PCA minimizes it.

Yet the calculation is different (Though both could be calculated using SVD).

Is it the exact solution?

If so, why bother with the TLS?

Royi, Yes TLS and PCA both effectively do the same thing. But, as you point out, they do it differently. So likely the methods gives different computational efficiency. I did a little googling and came up with this paper: http://cds.ismrm.org/ismrm-2000/PDF7/1941.PDF It looks like the methods have different efficiency as well as different accuracy. And neither is the ‘exact’ solution Nothing ever is.

I like that way of looking at it. I’ve recently started thinking of PCA by way of analogy to physics. In phys. if you kick a football 35° north and 65° east then the maths will be really terrible in that coordinate system. It will be much simpler if you align the $x$ axis straight as you’re kicking the ball.

The first dimension of PCA is doing the same thing. You rotate your coordinate system (hence the convex combo) to be more friendly; PCA just happens to choose the “widest” direction in your data cloud automatically as the 1st axis.

Wow! All it took was your well constructed graphs to help me “get it”. Thanks!