Principal Component Analysis#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import wrds

In the previous chapter, we decomposed stock returns using a factor model — the Fama-French 3-factor model. We regressed each stock’s excess return on the market, SMB, and HML factors, and showed that these three factors explain a substantial portion of the variation in individual stock returns.

But where did these factors come from? Fama and French chose them based on economic theory and empirical observation. What if instead of starting with a theory, we let the data tell us the most important sources of variation in stock returns?

This is exactly what Principal Component Analysis (PCA) does. PCA is a statistical technique that finds the directions of greatest variance in a dataset. Applied to stock returns, PCA will discover the common factors that drive returns — without any economic theory at all.

Dow Jones constituents#

As and example, we’ll begin by getting data for the stocks that make up the Dow Jones Industrial Average.

dow = pd.read_csv('https://raw.githubusercontent.com/stoffprof/qf_data/main/dow_current_components.csv')

dow
company ticker permno
0 3M Company MMM 22592
1 Amazon.com, Inc. AMZN 84788
2 American Express Company AXP 59176
3 Amgen Inc. AMGN 14008
4 Apple Inc. AAPL 14593
5 Caterpillar Inc. CAT 18542
6 Cisco Systems, Inc. CSCO 76076
7 Honeywell International Inc. HON 10145
8 JPMorgan Chase & Co. JPM 47896
9 Johnson & Johnson JNJ 22111
10 Microsoft Corporation MSFT 10107
11 Nike, Inc. NKE 57665
12 Nvidia Corporation NVDA 86580
13 Salesforce, Inc. CRM 90215
14 The Boeing Company BA 19561
15 The Coca-Cola Company KO 11308
16 The Goldman Sachs Group, Inc. GS 86868
17 The Home Depot, Inc. HD 66181
18 The Procter & Gamble Company PG 18163
19 The Sherwin-Williams Company SHW 36468
20 The Travelers Companies, Inc. TRV 59459
21 UnitedHealth Group Incorporated UNH 92655
22 Verizon Communications Inc. VZ 65875
23 Visa Inc. V 92611
24 Walmart Inc. WMT 55976
25 Merck & Co., Inc. MRK 22752
26 The Walt Disney Company DIS 26403
27 McDonald's Corporation MCD 43449
28 Chevron Corporation CVX 14541
29 International Business Machines Corporation IBM 12490
db = wrds.Connection()
Loading library list...
Done
sql = """
      SELECT permno, ticker, dlycaldt, dlyret
      FROM crsp.dsf_v2
      WHERE permno IN %(permnos)s
      AND dlycaldt >= '2020-01-01'
      """

params = {'permnos': tuple(dow['permno'].to_list())}

rets = db.raw_sql(sql, params=params, date_cols=['dlycaldt'])
rets
permno ticker dlycaldt dlyret
0 10107 MSFT 2020-01-02 0.018516
1 10107 MSFT 2020-01-03 -0.012452
2 10107 MSFT 2020-01-06 0.002585
3 10107 MSFT 2020-01-07 -0.009118
4 10107 MSFT 2020-01-08 0.015928
... ... ... ... ...
45235 92655 UNH 2025-12-24 0.008559
45236 92655 UNH 2025-12-26 0.012974
45237 92655 UNH 2025-12-29 -0.008709
45238 92655 UNH 2025-12-30 0.009789
45239 92655 UNH 2025-12-31 -0.006172

45240 rows × 4 columns

This dataframe is in a “long” format, with the data for each firm stacked one after another. In order to look at the correlations among firm returns, we need to reshape the data so each stock is in it’s own column.

rets = rets.pivot(index='dlycaldt', columns='ticker', values='dlyret').sort_index(axis=1)

rets
ticker AAPL AMGN AMZN AXP BA CAT CRM CSCO CVX DIS ... MSFT NKE NVDA PG SHW TRV UNH V VZ WMT
dlycaldt
2020-01-02 0.022816 -0.004024 0.027151 0.014379 0.023207 0.019299 0.026746 0.016889 0.007634 0.024684 ... 0.018516 0.008785 0.019592 -0.01193 -0.020033 0.004089 -0.005034 0.017137 -0.0057 0.000841
2020-01-03 -0.009722 -0.006789 -0.012139 -0.009932 -0.00168 -0.013884 -0.00491 -0.016316 -0.003459 -0.011471 ... -0.012452 -0.00274 -0.016006 -0.006726 -0.012941 -0.003563 -0.01012 -0.007953 -0.010647 -0.008828
2020-01-06 0.007968 0.007674 0.014886 -0.004334 0.002945 -0.000674 0.043811 0.003569 -0.003388 -0.005802 ... 0.002585 -0.000883 0.004194 0.001387 -0.001843 0.001095 0.006942 -0.002162 -0.002152 -0.002036
2020-01-07 -0.004703 -0.009405 0.002092 -0.005239 0.010607 -0.013213 0.014702 -0.006485 -0.01277 0.000343 ... -0.009118 -0.000491 0.012107 -0.006191 -0.005413 -0.014653 -0.006037 -0.002643 -0.011117 -0.009265
2020-01-08 0.016086 0.000756 -0.007809 0.01726 -0.017523 0.008881 0.007557 0.000632 -0.011423 -0.002059 ... 0.015928 -0.00226 0.001876 0.004263 0.015918 0.010728 0.021084 0.017118 0.001846 -0.003432
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2025-12-24 0.005324 0.007451 0.001034 0.002407 0.006041 0.002301 0.006947 0.0 -0.000066 0.011129 ... 0.002403 0.04639 -0.003171 0.009149 0.004012 0.004719 0.008559 0.00498 0.01002 0.006402
2025-12-26 -0.001497 -0.003084 0.000602 -0.005377 -0.007884 -0.001302 0.003091 0.001794 -0.003189 -0.008036 ... -0.000635 0.0155 0.01018 0.00173 0.000277 -0.005854 0.012974 -0.000394 0.003968 0.001165
2025-12-29 0.001317 -0.009912 -0.001935 -0.015037 0.003742 -0.00753 0.000564 -0.004734 0.006466 0.005548 ... -0.001251 0.004595 -0.012124 -0.001175 -0.000645 -0.00065 -0.008709 -0.001099 0.0 0.00707
2025-12-30 -0.002484 -0.002852 0.001982 -0.005142 0.005754 -0.002109 -0.001164 -0.004885 0.008742 0.005254 ... 0.00078 -0.000327 -0.003613 -0.003597 0.00286 0.001541 0.009789 -0.002792 0.005435 -0.005421
2025-12-31 -0.004468 -0.004198 -0.007354 -0.009213 -0.006316 -0.007828 -0.003798 -0.004909 0.000657 -0.008886 ... -0.007918 0.041183 -0.005545 -0.005137 -0.006531 -0.007935 -0.006172 -0.008229 0.000737 -0.004557

1508 rows × 30 columns

rets.corr()
ticker AAPL AMGN AMZN AXP BA CAT CRM CSCO CVX DIS ... MSFT NKE NVDA PG SHW TRV UNH V VZ WMT
ticker
AAPL 1.000000 0.372698 0.588565 0.464588 0.408966 0.369408 0.521615 0.540079 0.344762 0.449628 ... 0.709334 0.479990 0.578697 0.397282 0.472431 0.323185 0.320916 0.572172 0.255233 0.370929
AMGN 0.372698 1.000000 0.227641 0.304566 0.182768 0.327351 0.242629 0.403563 0.288820 0.252920 ... 0.347816 0.251003 0.214736 0.453415 0.352975 0.330391 0.377750 0.380859 0.359040 0.306147
AMZN 0.588565 0.227641 1.000000 0.350820 0.302305 0.285735 0.556095 0.402166 0.182239 0.418601 ... 0.665179 0.400072 0.578519 0.195819 0.352554 0.160185 0.183208 0.403118 0.121580 0.290397
AXP 0.464588 0.304566 0.350820 1.000000 0.616399 0.601308 0.401060 0.520066 0.598165 0.627924 ... 0.461831 0.469285 0.388715 0.307739 0.470060 0.564530 0.376842 0.689667 0.277706 0.240530
BA 0.408966 0.182768 0.302305 0.616399 1.000000 0.458273 0.329436 0.387195 0.505067 0.499486 ... 0.371455 0.409898 0.336690 0.218302 0.344662 0.432992 0.276272 0.484157 0.196765 0.177394
CAT 0.369408 0.327351 0.285735 0.601308 0.458273 1.000000 0.307555 0.470481 0.565401 0.487352 ... 0.365237 0.395954 0.350437 0.235023 0.387264 0.450491 0.258833 0.473855 0.285614 0.219436
CRM 0.521615 0.242629 0.556095 0.401060 0.329436 0.307555 1.000000 0.420942 0.265028 0.411995 ... 0.613328 0.373062 0.525674 0.234544 0.408523 0.252194 0.254354 0.483644 0.149421 0.243560
CSCO 0.540079 0.403563 0.402166 0.520066 0.387195 0.470481 0.420942 1.000000 0.418337 0.474927 ... 0.552759 0.421750 0.439537 0.443482 0.465585 0.390867 0.327715 0.548290 0.332182 0.388439
CVX 0.344762 0.288820 0.182239 0.598165 0.505067 0.565401 0.265028 0.418337 1.000000 0.465362 ... 0.322090 0.347589 0.252549 0.243176 0.315675 0.534679 0.359316 0.497087 0.296481 0.180972
DIS 0.449628 0.252920 0.418601 0.627924 0.499486 0.487352 0.411995 0.474927 0.465362 1.000000 ... 0.463806 0.475274 0.386416 0.283895 0.405213 0.388714 0.255559 0.577465 0.272288 0.247206
GS 0.483475 0.328065 0.386454 0.746153 0.556916 0.637624 0.414591 0.524205 0.575054 0.585542 ... 0.479904 0.448737 0.419982 0.302804 0.460269 0.552762 0.347310 0.589768 0.285640 0.272102
HD 0.536499 0.393623 0.418300 0.507309 0.421740 0.429114 0.447851 0.511961 0.411454 0.476504 ... 0.539301 0.507317 0.425609 0.460134 0.689482 0.463651 0.386319 0.544729 0.334142 0.405170
HON 0.478002 0.386366 0.303614 0.676312 0.586627 0.596636 0.363728 0.542813 0.571305 0.569285 ... 0.462178 0.503209 0.338508 0.422328 0.507583 0.543798 0.399333 0.623669 0.337499 0.275358
IBM 0.388039 0.363489 0.251397 0.515687 0.410131 0.469301 0.302104 0.527384 0.452859 0.404337 ... 0.363539 0.343250 0.284737 0.419280 0.412361 0.446925 0.323580 0.507143 0.360402 0.289566
JNJ 0.325489 0.539199 0.125530 0.311145 0.223201 0.296031 0.183663 0.414222 0.345219 0.267851 ... 0.303010 0.270663 0.083326 0.591703 0.360348 0.397157 0.412243 0.427085 0.426305 0.354220
JPM 0.422049 0.321624 0.298973 0.776750 0.569423 0.618827 0.350898 0.508758 0.612632 0.571619 ... 0.427012 0.415975 0.341754 0.338611 0.417677 0.616895 0.362399 0.612432 0.330122 0.261934
KO 0.390138 0.404943 0.166679 0.469218 0.401141 0.344236 0.259800 0.456227 0.420166 0.397577 ... 0.366908 0.365330 0.155023 0.646636 0.443732 0.501940 0.384966 0.533396 0.472627 0.371085
MCD 0.440202 0.367915 0.234836 0.496450 0.432544 0.353689 0.353726 0.461637 0.452902 0.420084 ... 0.426040 0.419180 0.275184 0.479922 0.511896 0.519601 0.413106 0.581576 0.339063 0.306192
MMM 0.382275 0.351190 0.274413 0.509391 0.364273 0.532627 0.294764 0.470543 0.400574 0.440202 ... 0.351507 0.407161 0.271369 0.364037 0.464411 0.423952 0.291942 0.471409 0.289976 0.269999
MRK 0.280452 0.477731 0.130039 0.260542 0.186023 0.256410 0.186759 0.302858 0.318486 0.196915 ... 0.266531 0.274815 0.118893 0.440882 0.313612 0.345820 0.360194 0.355980 0.338002 0.234759
MSFT 0.709334 0.347816 0.665179 0.461831 0.371455 0.365237 0.613328 0.552759 0.322090 0.463806 ... 1.000000 0.428815 0.675745 0.390881 0.469396 0.334754 0.341997 0.579932 0.212997 0.364336
NKE 0.479990 0.251003 0.400072 0.469285 0.409898 0.395954 0.373062 0.421750 0.347589 0.475274 ... 0.428815 1.000000 0.364749 0.342250 0.438689 0.315364 0.237059 0.501999 0.227829 0.258685
NVDA 0.578697 0.214736 0.578519 0.388715 0.336690 0.350437 0.525674 0.439537 0.252549 0.386416 ... 0.675745 0.364749 1.000000 0.170916 0.391632 0.178149 0.217265 0.448263 0.056020 0.223718
PG 0.397282 0.453415 0.195819 0.307739 0.218302 0.235023 0.234544 0.443482 0.243176 0.283895 ... 0.390881 0.342250 0.170916 1.000000 0.427550 0.403076 0.373357 0.446810 0.471544 0.480696
SHW 0.472431 0.352975 0.352554 0.470060 0.344662 0.387264 0.408523 0.465585 0.315675 0.405213 ... 0.469396 0.438689 0.391632 0.427550 1.000000 0.425959 0.349141 0.543030 0.300739 0.325499
TRV 0.323185 0.330391 0.160185 0.564530 0.432992 0.450491 0.252194 0.390867 0.534679 0.388714 ... 0.334754 0.315364 0.178149 0.403076 0.425959 1.000000 0.366954 0.515575 0.360751 0.283174
UNH 0.320916 0.377750 0.183208 0.376842 0.276272 0.258833 0.254354 0.327715 0.359316 0.255559 ... 0.341997 0.237059 0.217265 0.373357 0.349141 0.366954 1.000000 0.422002 0.290063 0.244062
V 0.572172 0.380859 0.403118 0.689667 0.484157 0.473855 0.483644 0.548290 0.497087 0.577465 ... 0.579932 0.501999 0.448263 0.446810 0.543030 0.515575 0.422002 1.000000 0.337077 0.313942
VZ 0.255233 0.359040 0.121580 0.277706 0.196765 0.285614 0.149421 0.332182 0.296481 0.272288 ... 0.212997 0.227829 0.056020 0.471544 0.300739 0.360751 0.290063 0.337077 1.000000 0.282519
WMT 0.370929 0.306147 0.290397 0.240530 0.177394 0.219436 0.243560 0.388439 0.180972 0.247206 ... 0.364336 0.258685 0.223718 0.480696 0.325499 0.283174 0.244062 0.313942 0.282519 1.000000

30 rows × 30 columns

Hide code cell source

import seaborn as sns

corrmat = rets.corr()

mask = np.triu(np.ones_like(corrmat, dtype=bool))

fig, ax = plt.subplots(figsize=(8, 5))
sns.heatmap(corrmat, mask=mask, cmap='Greens', annot=False, fmt=".2f", linewidths=.5, ax=ax,
            cbar_kws={'shrink': 0.6, 'ticks': np.linspace(-1, 1, 13)})
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=6)

ax.set_xlabel("")
ax.set_ylabel("")
ax.tick_params(axis='both', which='both', labelsize=5)
ax.set_xticks(np.arange(len(corrmat.columns)) + 0.5)
ax.set_yticks(np.arange(len(corrmat.columns)) + 0.5)
ax.set_xticklabels(corrmat.columns, rotation=90)
ax.set_yticklabels(corrmat.columns)

plt.show()
_images/e9137056889da9359412f99af1f72c10c577fbb0ca4cefd1db6138ef82346af9.png

Eigenvalues and Eigenvectors#

For an \(N\times N\) matrix \(A\), a scalar \(\lambda\) and non-zero vector \(\mathbf{v}\) satisfying

\[A \mathbf{v} = \lambda \mathbf{v}\]

are called an eigenvalue and eigenvector of \(A\), respectively.

We want to find \(\lambda\) and non-zero \(\mathbf{v}\) such that \(A \mathbf{v} = \lambda \mathbf{v}\). Rearranging:

\[\begin{split}\begin{align} A \mathbf{v} - \lambda \mathbf{v} &= \mathbf{0} \\ \Rightarrow (A - \lambda I)\mathbf{v} &= \mathbf{0}. \end{align}\end{split}\]

This has a non-zero solution \(\mathbf{v}\) if and only if the matrix \((A - \lambda I)\) is singular — that is, it cannot be inverted. A matrix is singular if and only if its determinant is zero, so we require:

\[\det(A - \lambda I) = 0.\]

This is the characteristic equation. Any \(\lambda\) satisfying it is an eigenvalue, and the corresponding eigenvector \(\mathbf{v}\) is then found by solving \((A - \lambda I)\mathbf{v} = \mathbf{0}\) directly.

Since the characteristic equation actually defines a system of \(N\) equations, there are \(N\) solutions, each of which gives an eigenvalue and corresponding eigenvector.

Having found the eigenvalues and eigenvectors, we can collect them into two matrices:

  • \(\boldsymbol{\Lambda}\) is a diagonal matrix of eigenvalues \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_N\); and

  • \(\mathbf{V}\) is a matrix whose columns are the corresponding eigenvectors.

This allows us to decompose the matrix \(A\) as follows:

\[A = \mathbf{V} \boldsymbol{\Lambda} \mathbf{V}'.\]

In other words, the matrix \(A\) is completely characterized by its eigenvalues and eigenvectors. This is called the eigendecomposition.

Calculation example#

Let

\[\begin{split}A = \begin{bmatrix} 4 & 2 \\ 2 & 3 \end{bmatrix}.\end{split}\]

Begin by finding the characteristic equation:

\[\begin{split}\begin{align*} \det(A - \lambda I) = \det\begin{bmatrix} 4-\lambda & 2 \\ 2 & 3-\lambda \end{bmatrix} &= 0 \\ \Rightarrow\; (4-\lambda)(3-\lambda) - 4 &= 0 \\ \Rightarrow\; \lambda^2 - 7\lambda + 12 - 4 &= 0 \\ \Rightarrow\; \lambda^2 - 7\lambda + 8 &= 0 \end{align*}\end{split}\]

Next we solve the equation to find the eigenvalues. There are two in this case, because \(A\) is \(2\times 2\).

\[\lambda = \frac{7 \pm \sqrt{49 - 32}}{2} = \frac{7 \pm \sqrt{17}}{2}\]
\[\lambda_1 = \frac{7 + \sqrt{17}}{2} \approx 5.561, \qquad \lambda_2 = \frac{7 - \sqrt{17}}{2} \approx 1.439.\]

Note that \(\lambda_1 + \lambda_2 = 7 = \text{tr}(A)\).

Next, we solve for the eigenvectors. For each \(\lambda_i\), solve \((A - \lambda_i I)\mathbf{v} = \mathbf{0}\). The first row gives:

\[(4 - \lambda_i)v_1 + 2v_2 = 0 \implies v_2 = \frac{\lambda_i - 4}{2} v_1\]

For \(\lambda_1 \approx 5.561\):

\[v_2 = \frac{5.561 - 4}{2} v_1 = 0.781 \, v_1\]

Setting \(v_1 = 1\), the unnormalized vector is \(\begin{bmatrix} 1 \\ 0.781 \end{bmatrix}\), with length \(\sqrt{1 + 0.781^2} = \sqrt{1.610} \approx 1.269\). Dividing through by this, we get

\[\begin{split}\mathbf{v}_1 = \begin{bmatrix} 0.788 \\ 0.615 \end{bmatrix}\end{split}\]

For \(\lambda_2 \approx 1.439\):

\[v_2 = \frac{1.439 - 4}{2} v_1 = -1.281 \, v_1\]

Setting \(v_1 = 1\), the unnormalized vector is \(\begin{bmatrix} 1 \\ -1.281 \end{bmatrix}\), with length \(\sqrt{1 + 1.281^2} = \sqrt{2.641} \approx 1.625\). Again, dividing by this, we find

\[\begin{split}\mathbf{v}_2 = \begin{bmatrix} 0.615 \\ -0.788 \end{bmatrix}.\end{split}\]

We can now use

\[\begin{split}V = \begin{bmatrix} 0.788 & 0.615 \\ 0.615 & -0.788 \end{bmatrix}, \qquad \Lambda = \begin{bmatrix} 5.561 & 0 \\ 0 & 1.439 \end{bmatrix}\end{split}\]

to confirm the decomposition

\[A = \mathbf{V} \boldsymbol{\Lambda} \mathbf{V}'.\]
V = np.array([[0.788, 0.615],
              [0.615, -0.788]])

L = np.array([[5.561, 0],
              [0, 1.439]])

V @ L @ V.T
array([[3.99733536, 1.99760364],
       [1.99760364, 2.99684764]])

There’s some small rounding error here, but this is very close to the original \(A\) matrix.

What is PCA?#

Suppose we have a matrix of returns \(\mathbf{R}\) with \(T\) rows (time periods) and \(N\) columns (stocks). The covariance matrix of returns is the \(N \times N\) matrix

\[\bs = \cov(\mathbf{R}).\]

PCA is the eigendecomposition of the covariance matrix. Each eigenvector (also called the loadings or weights) defines a principal component — a linear combination of the original variables. In the case of stock returns, each principal component is therefore a portfolio return, where the eigenvector elements are the portfolio weights. The corresponding eigenvalue tells us how much variance that portfolio captures.

Geometrically, since \(\mathbf{v}\) is a direction that the matrix only stretches (by factor \(\lambda\)) rather than rotates, the eigenvectors of a covariance matrix identify the directions of greatest spread in the data, and the corresponding eigenvalues measure the amount of variance along each of those directions. For a symmetric matrix like \(\Sigma\), eigenvectors corresponding to distinct eigenvalues are always orthogonal, meaning the principal directions are perpendicular to one another. This orthogonality is what makes PCA a rotation of the original coordinate axes rather than an arbitrary transformation: the data is re-expressed in a new basis of mutually perpendicular directions, ordered from most to least variance.

The scatter plot below shows 5,000 draws from a bivariate normal distribution with positive covariance between X and Y. The two arrows are the eigenvectors of the covariance matrix, and their lengths are proportional to \(\sqrt{\lambda_i}\) — the standard deviation along each principal axis.

The longer axis Z points in the direction of greatest variance in the data: it aligns roughly with the major axis of the elliptical cloud. The shorter axis W is perpendicular to Z and captures the remaining variance. Together, Z and W form a new coordinate system that is rotated relative to the original X-Y axes.

The dashed lines extend each eigenvector direction across the full plot, making it clear that PCA is identifying the natural axes of the data rather than the arbitrary axes we happened to measure it on.

If we project each data point onto Z and W instead of X and Y, the resulting coordinates will be uncorrelated, as shown in the second figure. Projection here simply means measuring how far each point lies along each axis — the same way an X coordinate measures how far a point lies along the horizontal axis, but now measured along the tilted PC axes instead. This is the core operation of PCA.

Hide code cell source

from sklearn.decomposition import PCA

# Generate correlated 2D data
np.random.seed(42)
mean = [4, 4]
cov = [[4, 3], [3, 3]]
X = np.random.multivariate_normal(mean, cov, 5000)

# Fit PCA
pca = PCA(n_components=2)
pca.fit(X)

# Eigenvectors scaled by eigenvalues for arrow length
origin = X.mean(axis=0)
vectors = pca.components_
values = pca.explained_variance_

def draw_original_axes(ax):
    ax.scatter(X[:, 0], X[:, 1], alpha=0.15, s=5, color='gray')

    for vec, val, label in zip(vectors, values, ['Z', 'W']):
        scaled = vec * np.sqrt(val) * 2
        ax.annotate('', xy=origin + scaled, xytext=origin,
                    arrowprops=dict(arrowstyle='->', color='black', lw=2))
        ax.text(*(origin + scaled * 1.15), label, fontsize=12, fontweight='bold')

    for vec in vectors:
        t = np.linspace(-10, 10, 100)
        line = origin[:, None] + vec[:, None] * t
        ax.plot(line[0], line[1], color='red', linestyle='-.', linewidth=1)

fig, axes = plt.subplots(2, 1, figsize=(6, 10))

# Top panel: original data with PC axes
ax = axes[0]
draw_original_axes(ax)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_xlim(-7, 12)
ax.set_ylim(-6, 12)
ax.set_aspect('equal')
ax.set_title('Original coordinates (X, Y)')

# Bottom panel: projected (PCA) coordinates
X_transformed = pca.transform(X)

ax = axes[1]
ax.scatter(X_transformed[:, 0], X_transformed[:, 1], alpha=0.15, s=5, color='gray')
ax.axhline(0, color='red', linestyle='-.', linewidth=1)
ax.axvline(0, color='red', linestyle='-.', linewidth=1)
ax.annotate('', xy=(np.sqrt(values[0]) * 2, 0), xytext=(0, 0),
            arrowprops=dict(arrowstyle='->', color='black', lw=2))
ax.annotate('', xy=(0, np.sqrt(values[1]) * 2), xytext=(0, 0),
            arrowprops=dict(arrowstyle='->', color='black', lw=2))
ax.text(np.sqrt(values[0]) * 2.15, 0.2, 'Z', fontsize=12, fontweight='bold')
ax.text(0.2, np.sqrt(values[1]) * 2.15, 'W', fontsize=12, fontweight='bold')

corr = np.corrcoef(X_transformed[:, 0], X_transformed[:, 1])[0, 1]
ax.set_xlabel('PC1 (Z)')
ax.set_ylabel('PC2 (W)')
ax.set_title(f'PC coordinates (Z, W)  —  correlation = {corr:.4f}')
ax.set_aspect('equal')

plt.tight_layout()
plt.show()
_images/96b17798b26d84d07cc99a79b09ca23ce29cc7c0efd6190a31c87c1e427970f1.png

Key fact

PCA finds orthogonal (uncorrelated) linear combinations of the original variables that capture the maximum amount of variance. The fraction of total variance explained by the \(k\)-th principal component is

\[\frac{\lambda_k}{\sum_{j=1}^N \lambda_j}.\]

The first principal component captures the most variance, the second captures the most of what’s left, and so on.

The sum of the eigenvalues equals the sum of variances, which equals the trace of the covariance matrix:

\[\sum_{j=1}^N \lambda_j = \text{tr}(\boldsymbol{\Sigma}) = \sum_{i=1}^N \var(R_i).\]

Example: The 2×2 Case#

We can gain some intuition by considering the simplest case of two stocks. Let the variance-covariance matrix be:

\[\begin{split}\Sigma = \begin{bmatrix} \sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{bmatrix}\end{split}\]

Step 1: Find the Characteristic Equation

\[\begin{align*} \det(\Sigma - \lambda I) = \det\begin{bmatrix} \sigma_1^2 - \lambda & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2 - \lambda \end{bmatrix} = 0 \\[2em] \Rightarrow (\sigma_1^2 - \lambda)(\sigma_2^2 - \lambda) - \rho^2\sigma_1^2\sigma_2^2 = 0 \\[2em] \Rightarrow \lambda^2 - (\sigma_1^2 + \sigma_2^2)\lambda + \sigma_1^2\sigma_2^2(1 - \rho^2) = 0 \end{align*}\]

Step 2: Solve for the eigenvalues

Applying the quadratic formula, with discriminant:

\[\Delta = (\sigma_1^2 + \sigma_2^2)^2 - 4\sigma_1^2\sigma_2^2(1-\rho^2) = (\sigma_1^2 - \sigma_2^2)^2 + 4\rho^2\sigma_1^2\sigma_2^2\]

gives:

\[\lambda_{1,2} = \frac{(\sigma_1^2 + \sigma_2^2) \pm \sqrt{(\sigma_1^2 - \sigma_2^2)^2 + 4\rho^2\sigma_1^2\sigma_2^2}}{2}\]

Step 3: Find the eigenvectors

For each eigenvalue, solve \((\Sigma - \lambda I)\mathbf{v} = \mathbf{0}\). The first row gives:

\[(\sigma_1^2 - \lambda)v_1 + \rho\sigma_1\sigma_2\, v_2 = 0\]

so the eigenvector corresponding to each \(\lambda\) is proportional to:

\[\begin{split}\mathbf{v} \propto \begin{bmatrix} \rho\sigma_1\sigma_2 \\ \lambda - \sigma_1^2 \end{bmatrix}\end{split}\]

Substituting \(\lambda_1\) and \(\lambda_2\) gives two orthogonal eigenvectors, which can be normalized by dividing by their lengths:

\[\begin{split}\mathbf{v}_i = \frac{1}{\|\mathbf{v}_i\|}\begin{bmatrix} \rho\sigma_1\sigma_2 \\ \lambda_i - \sigma_1^2 \end{bmatrix}, \quad \|\mathbf{v}_i\| = \sqrt{(\rho\sigma_1\sigma_2)^2 + (\lambda_i - \sigma_1^2)^2}\end{split}\]

There are two special cases worth considering:

Special Case #1: Equal Variances \((\sigma_1^2 = \sigma_2^2 = \sigma^2)\)

Substituting into the eigenvalues, the \((\sigma_1^2 - \sigma_2^2)^2\) term vanishes:

\[\lambda_{1,2} = \frac{2\sigma^2 \pm \sqrt{4\rho^2\sigma^4}}{2} = \frac{2\sigma^2 \pm 2\rho\sigma^2}{2}\]
\[\boxed{\lambda_1 = \sigma^2(1+\rho), \qquad \lambda_2 = \sigma^2(1-\rho)}\]

The eigenvalues are symmetric around \(\sigma^2\), and total variance \(\lambda_1 + \lambda_2 = 2\sigma^2\) is preserved. As \(\rho \to 1\), all variance concentrates in PC1 and \(\lambda_2 \to 0\) — the two variables become a single factor. As \(\rho \to 0\), both eigenvalues equal \(\sigma^2\) and the two PCs are equivalent.

For the eigenvectors, substituting \(\sigma_1^2 = \sigma_2^2 = \sigma^2\) into the general formula \(\mathbf{v} \propto \begin{bmatrix} 2\rho\sigma_1\sigma_2 \\ \sigma_2^2 - \sigma_1^2 \pm \sqrt{\Delta} \end{bmatrix}\), the \((\sigma_2^2 - \sigma_1^2)\) terms vanish and \(\sqrt{\Delta} = 2\rho\sigma^2\):

\[\begin{split}\mathbf{v}_1 \propto \begin{bmatrix} 2\rho\sigma^2 \\ 2\rho\sigma^2 \end{bmatrix} \propto \begin{bmatrix} 1 \\ 1 \end{bmatrix}, \qquad \mathbf{v}_2 \propto \begin{bmatrix} 2\rho\sigma^2 \\ -2\rho\sigma^2 \end{bmatrix} \propto \begin{bmatrix} 1 \\ -1 \end{bmatrix}\end{split}\]

The principal axes are the 45° diagonals, regardless of \(\rho\) (as long as \(\rho \neq 0\)). PC1 points in the direction of the sum \(X_1 + X_2\) — the direction along which the two variables move together — and PC2 points in the direction of the difference \(X_1 - X_2\), along which they move against each other.

This has a natural financial interpretation: if two assets have equal variance, the first principal component is essentially the average return (long both, like a portfolio) and the second is the spread (long one, short the other, like a pairs trade). The more correlated the assets, the more variance is concentrated in the sum direction and the less in the spread direction.

Special Case #2: \(\rho = 0\)

When the variables are uncorrelated the matrix is already diagonal, and the eigenvalues and eigenvectors reduce to:

\[\begin{split}\lambda_1 = \sigma_1^2, \quad \lambda_2 = \sigma_2^2, \qquad \mathbf{v}_1 = \begin{bmatrix}1\\0\end{bmatrix}, \quad \mathbf{v}_2 = \begin{bmatrix}0\\1\end{bmatrix}\end{split}\]

The principal axes coincide with the original coordinate axes, as expected.

As an example, let’s take only the first three stocks in our data, and calculate the eigenvector decomposition from their covariances.

cov = (rets.iloc[:, :3] * 100).cov()

cov
ticker AAPL AMGN AMZN
ticker
AAPL 4.015538 1.242392 2.652895
AMGN 1.242392 2.767313 0.851792
AMZN 2.652895 0.851792 5.059491
eigvals, eigvecs = np.linalg.eigh(cov)

# need to sort the eigenvalues and eigenvectors in descending order
idx = np.argsort(eigvals)[::-1]   # descending order
eigvals = eigvals[idx]
eigvecs = eigvecs[:, idx]
eigvals
array([7.67092163, 2.56083097, 1.6105894 ])
np.diag(eigvals)
array([[7.67092163, 0.        , 0.        ],
       [0.        , 2.56083097, 0.        ],
       [0.        , 0.        , 1.6105894 ]])
eigvecs
array([[-0.6245524 , -0.20777194, -0.75283804],
       [-0.28457398, -0.83714691,  0.46712172],
       [-0.72729083,  0.50598011,  0.46371562]])

Note that we can recover the covariance matrix frmo the eigenvectors and eigenvalues using the equation above.

eigvecs @ np.diag(eigvals) @ eigvecs.T
array([[4.01553827, 1.24239198, 2.65289473],
       [1.24239198, 2.76731284, 0.8517921 ],
       [2.65289473, 0.8517921 , 5.05949088]])

As expected, the sum of the eigenvalues equals the sum of the variances:

sum(eigvals)
11.842341993240536
np.diag(cov).sum()
11.84234199324054

The three eigenvalues and corresponding proportion of variance explained by each principal component are:

for i, eigval in enumerate(eigvals):
    print(f"Eigenvalue {i+1}: {eigval:.4f} ({eigval / sum(eigvals) * 100:.1f}%)")
Eigenvalue 1: 7.6709 (64.8%)
Eigenvalue 2: 2.5608 (21.6%)
Eigenvalue 3: 1.6106 (13.6%)

Fitting PCA#

Now we can fit PCA to all 30 stocks in the Dow index.

pca = PCA()
pca.fit(rets)
PCA()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
print(f'Number of components: {pca.n_components_}')
print(f'Variance explained by first PC: {pca.explained_variance_ratio_[0]:.1%}')
print(f'Variance explained by first 3 PCs: {pca.explained_variance_ratio_[:3].sum():.1%}')
print(f'Variance explained by first 10 PCs: {pca.explained_variance_ratio_[:10].sum():.1%}')
Number of components: 30
Variance explained by first PC: 43.8%
Variance explained by first 3 PCs: 60.2%
Variance explained by first 10 PCs: 79.3%

Scree plot#

A scree plot shows the fraction of variance explained by each principal component. The name comes from the geological term for the pile of rocks at the base of a cliff — the plot looks like a steep cliff followed by a flat pile of rubble.

K = 10  # show first K components

fig, ax = plt.subplots(figsize=(6, 3))

# Bar chart of individual explained variance
ax.bar(range(1, K+1), pca.explained_variance_ratio_[:K], 
       color='steelblue', edgecolor='k', alpha=0.7, label='Individual')

# Line plot of cumulative explained variance
ax2 = ax.twinx()
ax2.plot(range(1, K+1), np.cumsum(pca.explained_variance_ratio_[:K]), 
         'o-', color='darkred', lw=2, ms=5, label='Cumulative')

ax.set_xlabel('Principal Component')
ax.set_ylabel('Fraction of Variance Explained')
ax2.set_ylabel('Cumulative Variance Explained')
ax.set_title('Scree Plot')
ax.set_xticks(range(1, K+1))

# Combine legends
lines1, labels1 = ax.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax.legend(lines1 + lines2, labels1 + labels2, loc='center right')

ax.grid(alpha=0.3, axis='y')

plt.show()
_images/f01661d6c98cd8c97c63fdffe0add88c788dd32283c36648caf7633ff4cab1a2.png

The first principal component dominates — it explains far more variance than any other single component. This is a well-known feature of stock returns: stocks tend to move together, driven by a single common “market” factor.

Interpreting principal components#

Each principal component is defined by a vector of loadings — one weight for each stock. Let’s look at the loadings of the first three principal components.

Hide code cell source

fig, axes = plt.subplots(3, 1, figsize=(18, 12), sharey=False)

for k, ax in enumerate(axes):
    pc = pca.components_[k]
    pc = pc * np.sign(pc.sum())
    pc = pc / pc.sum()

    ax.bar(range(len(pc)), pc, edgecolor='k', alpha=0.7)
    ax.set_xlabel('Stock')
    ax.set_title(f'Principal Component {k+1}')
    ax.set_xticks(range(len(pc)))
    ax.set_xticklabels(rets.columns, rotation=90)
    ax.grid(alpha=0.3, axis='y')

axes[0].set_ylabel('Loading')
fig.tight_layout()
plt.show()
_images/6280d53b73b54434b01b3530852089015620f9f355c1537ef1522f432839d3cb.png

The first principal component has positive weight on all stocks, and the weights are fairly close to 0.03, which corresponds to an equal weight on all stocks in the index. In other words, PC1 corresponds pretty closely to the market portfolio. We can see this by looking at the correlations among an equal-weighted Dow portfolio, the Fama–French market portfolio, and PC1.

# 1) Equal-weighted portfolio return across the 30 Dow stocks
ew_port = rets.mean(axis=1)

# 2) Market return from Fama-French daily factors
ff = db.get_table('ff', 'factors_daily')
ff['date'] = pd.to_datetime(ff['date'])
ff = ff.set_index('date').sort_index()
mkt = ff['mktrf'] + ff['rf']

# 3) First principal-component score time series
pc1 = pd.Series(pca.transform(rets)[:, 0], index=rets.index, name='pc1')

# Align dates and compute pairwise correlations
corr_df = pd.concat([
    ew_port.rename('ew_port'),
    mkt.rename('mkt'),
    pc1
], axis=1).dropna()

corr_df.corr()
ew_port mkt pc1
ew_port 1.000000 0.953132 0.992252
mkt 0.953132 1.000000 0.964445
pc1 0.992252 0.964445 1.000000

PC1 maximizes average \(R^2\)#

We’ve seen that PC1 looks a lot like the market — it has positive weight on every stock and explains the largest share of total variance. But we can make a stronger claim: PC1 is the single factor that best explains individual stock returns, in the sense of maximizing the average \(R^2\) across all stocks.

To see this, consider forming a candidate factor as a portfolio return:

\[f_t = \mathbf{w}'\mathbf{r}_t\]

where \(\mathbf{w}\) is some weight vector. For each stock \(i\), regress its return on the factor:

\[r_{i,t} = \alpha_i + \beta_i f_t + \varepsilon_{i,t}\]

and record \(R^2_i\). Then compute the average across all 30 stocks:

\[\bar{R}^2 = \frac{1}{30}\sum_{i=1}^{30} R^2_i.\]

The claim is: the weight vector \(\mathbf{w}\) that maximizes \(\bar{R}^2\) is precisely the first eigenvector.

We can verify this numerically. Below, we try 100,000 random weight vectors and compute \(\bar{R}^2\) for each. We then compare the best random vector to the first eigenvector from PCA.

def r2s_vectorized(R, f):
    '''
    Compute R^2 between each column of R and factor f.

    Parameters
    ----------
    R : array (T, N)
        Matrix of returns (T time periods, N assets)
    f : array (T,)
        Factor

    Returns
    -------
    r2s : array (N,)
        R^2 for regression of each column of R on f with intercept
    '''

    R = np.asarray(R)
    f = np.asarray(f)

    # Demean (equivalent to including intercept)
    Rc = R - R.mean(axis=0)
    fc = f - f.mean()

    # Covariance with factor
    cov = fc @ Rc

    # Variances
    var_f = fc @ fc
    var_R = np.sum(Rc**2, axis=0)

    # Correlation
    corr = cov / np.sqrt(var_f * var_R)

    return corr**2

def avg_r2(w, R):
    """Average R² from regressing each column of R on the factor w'R."""
    f = R @ w
    r2s = r2s_vectorized(R, f)
    return r2s.mean()


R = rets.to_numpy(dtype=float)
N = R.shape[1]

# --- Random search over 100,000 weight vectors ---
n_trials = 100_000
rng_search = np.random.default_rng(42)
best_r2_random = 0
best_w_random = None

for _ in range(n_trials):
    w = rng_search.normal(size=N)
    w = w / np.linalg.norm(w)   # normalize so weights sum to one
    r2 = avg_r2(w, R)
    if r2 > best_r2_random:
        best_r2_random = r2
        best_w_random = w

# --- First eigenvector from PCA ---
w_pc1 = pca.components_[0]
r2_pc1 = avg_r2(w_pc1, R)

print(f'Average R² using PC1 weights:           {r2_pc1:.6f}')
print(f'Best average R² from 100k random tries: {best_r2_random:.6f}')
Average R² using PC1 weights:           0.415639
Best average R² from 100k random tries: 0.410664

No random weight vector can beat PC1. This is not a coincidence — it is a mathematical property of the eigendecomposition. The first eigenvector of the covariance matrix defines the direction that maximizes the variance of the projected data, and maximizing the variance of a single factor is equivalent to maximizing the average \(R^2\) of individual regressions on that factor.

Interestingly, the random weights that come closest to explaining the most variation are nothing like the weights in PC1:

best_w_random
array([-0.12656536, -0.0178549 , -0.02130421, -0.09306356, -0.12076946,
       -0.13255932,  0.03796464, -0.22180042,  0.02787833, -0.10674491,
       -0.08927331, -0.26148374, -0.07184436, -0.1783589 ,  0.15739297,
       -0.30351353, -0.19744539, -0.06131585, -0.16368019, -0.44736973,
       -0.21245458, -0.07185023,  0.0275325 , -0.50145917, -0.08734835,
        0.03076371,  0.03924356,  0.05684753, -0.22145143, -0.12161234])
w_pc1
array([0.20253139, 0.11101268, 0.18420322, 0.26813552, 0.3009445 ,
       0.19434087, 0.21281547, 0.16601456, 0.19503764, 0.20906604,
       0.2318724 , 0.17974964, 0.18446533, 0.15243533, 0.08168023,
       0.21578408, 0.10660011, 0.12974605, 0.16362666, 0.0893042 ,
       0.19075013, 0.20321203, 0.30073486, 0.09092254, 0.16866732,
       0.15474742, 0.1457521 , 0.18887908, 0.07421308, 0.08463696])
def angle_between(u, v):
    cos_theta = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
    return np.arccos(cos_theta)  # radians

angle_rad = angle_between(best_w_random, w_pc1)
angle_deg = np.degrees(angle_rad)

print(f'Angle between best random weights and PC1: {angle_deg:.2f} degrees')
Angle between best random weights and PC1: 122.24 degrees

Now suppose we want to find the best second factor. We already know PC1 is the optimal single factor. The natural next question is: keeping PC1 fixed, what additional factor \(g_t = \mathbf{w}'\mathbf{r}_t\) maximizes the average \(R^2\) from a two-factor regression?

\[r_{i,t} = \alpha_i + \beta_{i,1} f_t + \beta_{i,2} g_t + \varepsilon_{i,t}\]

In other words, we want the second factor that explains the most of what PC1 leaves unexplained.

By appealing to something called the Frisch-Waugh theorem, we know that the \(R^2\) improvement from adding \(g_t\) to a regression that already includes \(f_t\) is the same as the \(R^2\) from regressing the residuals from that regression on \(g_t\). So we can implement this in two steps: first regress each stock on PC1 and collect the residuals \(\hat{\varepsilon}_{i,t}\), then search for the weight vector \(\mathbf{w}\) that maximizes \(\bar{R}^2\) when we regress each stock’s residuals on \(g_t = \mathbf{w}'\hat{\boldsymbol{\varepsilon}}_t\). The claim is that this optimal \(\mathbf{w}\) is the second eigenvector.

# Step 1: regress each stock on PC1 and collect residuals
f1 = R @ w_pc1
fc1 = f1 - f1.mean()
betas = (fc1 @ (R - R.mean(axis=0))) / (fc1 @ fc1)
resids = R - R.mean(axis=0) - np.outer(fc1, betas)   # (T, N)

# Step 2: search over random weight vectors applied to the residuals
best_r2_random2 = 0
best_w_random2 = None
rng_search2 = np.random.default_rng(123)

for _ in range(n_trials):
    w = rng_search2.normal(size=N)
    w = w / np.linalg.norm(w)
    r2 = avg_r2(w, resids)
    if r2 > best_r2_random2:
        best_r2_random2 = r2
        best_w_random2 = w

# Step 3: compare to the second eigenvector
w_pc2 = pca.components_[1]
r2_pc2 = avg_r2(w_pc2, resids)

print(f'Average R² of residuals using PC2 weights:  {r2_pc2:.6f}')
print(f'Best average R² from 100k random tries:     {best_r2_random2:.6f}')
Average R² of residuals using PC2 weights:  0.132291
Best average R² from 100k random tries:     0.125119

Finally, as expected, the angle between PC1 and PC2 confirms that they are orthogonal.

angle_rad = angle_between(pca.components_[0], pca.components_[1])
angle_deg = np.degrees(angle_rad)

print(f'Angle between PC1 and PC2: {angle_deg:.2f} degrees')
Angle between PC1 and PC2: 90.00 degrees

Variance decomposition#

How does PCA’s variance decomposition compare to the factor model approach from the previous chapter? In the Fama-French 3-factor model, the \(R^2\) tells us how much of a stock’s return variance is explained by the three factors. Here, we can ask: how many principal components do we need to explain a comparable fraction of variance?

cumvar = np.cumsum(pca.explained_variance_ratio_)

# Find how many PCs needed for various thresholds
for threshold in [0.25, 0.50, 0.75, 0.90]:
    n_pcs = np.argmax(cumvar >= threshold) + 1
    print(f'{threshold:.0%} of variance explained by {n_pcs} principal components')
25% of variance explained by 1 principal components
50% of variance explained by 2 principal components
75% of variance explained by 8 principal components
90% of variance explained by 18 principal components

Exercise

In the previous chapter, the average \(R^2\) from 3-factor regressions was roughly 0.15–0.35 depending on the stock. How does this compare to the fraction of total variance explained by the first 3 principal components? Why might these numbers differ?

Hint: Think about what “variance explained” means in each context. The factor model’s \(R^2\) is computed stock-by-stock, while PCA’s explained variance ratio is computed across all stocks simultaneously.