Build statistical factor models with PCA

Many beginners get confused when trying to figure out what really drives asset returns.

There are so many factors, and it’s hard to know which ones matter and which are just random.

Professionals use Python to run Principal Component Analysis, so they can cut through the noise and identify which factors are actually moving markets and driving risk.

By reading today’s newsletter, you’ll get Python code to build a PCA-based factor model, find the true risk drivers, and use those factors for robust risk analysis in your portfolio.

Let's go!

Build statistical factor models with PCA

Principal Component Analysis (PCA) is a core technique for extracting statistical factors that drive asset returns in quantitative investing. It lets you break down noisy return datasets and pull out the true underlying risk exposures.

PCA has roots in multivariate statistics and found early traction in finance for dimensionality reduction.

Since the 1950s, PCA has helped people make sense of high-dimensional data. In finance, statistical factor models gained prominence as computers and data allowed practical, data-driven risk management. Modern quant shops rely on PCA to cut through noise and isolate signals other models can't find.

These roots in theory connect directly to robust tools in professional investing now.

PCA is used to build better risk models, estimate stable covariance matrices, and construct portfolios that are more resilient to hidden risks. Professionals use it to identify independent return streams, flag market regime changes, and cleanly neutralize exposures in systematic strategies.

Let's see how it works with Python.

Imports and setup

We use matplotlib for plotting charts, numpy for numeric operations, statsmodels for statistical analysis and regression, yfinance for downloading historic price data, and scikit-learn for finding patterns in the data with principal component analysis (PCA).

1import matplotlib.pyplot as plt
2import numpy as np
3import statsmodels.api as sm
4import yfinance as yf
5from sklearn.decomposition import PCA
6from statsmodels import regression

This block selects a group of sector and asset tickers, pulls one year of daily historical closing prices using yfinance, and then calculates daily returns for this group.

1tickers = ["SPY", "XLE", "XLY", "XLP", "XLI", "XLU", "XLK", "XBI", "XLB", "XLF", "GLD"]
2price_data = yf.download(tickers, period="1y").Close
3returns = price_data.pct_change().dropna()
4returns = returns.fillna(0)

We set up a list of ticker symbols, fetch one year of their closing price history, and turn that price data into a table of daily returns for each symbol. We clean the results by dropping missing values and filling any leftovers with zeros, making analysis easier.

Analyze and visualize components

This block fits principal component analysis (PCA) on the returns to summarize most of the moves in these assets, then stores how many components are needed to capture 90% of the action. It also calculates how much each component tells us about the overall movement.

1pca = PCA(n_components=0.9, svd_solver="full")
2principal_components = pca.fit(returns)
3
4n_components_90 = pca.n_components_
5components_90 = pca.components_
6explained_variance = pca.explained_variance_ratio_
7
8lt.figure(figsize=(10, 6))
9plt.bar(range(1, n_components_90 + 1), explained_variance, alpha=0.7)
10plt.xlabel("Principal Component")
11plt.ylabel("Explained Variance Ratio")
12plt.title(f"PCA Number of Components by MLE: {n_components_90}")
13plt.tight_layout()
14plt.show()

Here we use PCA to break down the returns into key pieces that explain most of the changes. By setting it to cover 90% of the explained variance, we're making sure we only keep the most useful directions in the data. We save the number of important components, the directions they point, and how much of the total movement they each explain.

We then plot the results.

We build a visual that clearly shows—for each component found by the PCA—how much of the story it tells us about overall returns. The bar chart helps us spot which components matter and how many we need to keep to describe most of our data's behavior.

Visualize relationships between two assets

This block standardizes the returns, picks the first two assets in our list, and performs PCA to look for the main lines of movement between these two. It then plots them together with the directions of the two most important components.

1r = returns / returns.std()
2r1_s, r2_s = r.iloc[:, 0], r.iloc[:, 1]
3
4pca.fit(np.vstack((r1_s, r2_s)).T)
5components_s = pca.components_
6evr_s = pca.explained_variance_ratio_
7
8plt.figure(figsize=(8, 6))
9plt.scatter(r1_s, r2_s, alpha=0.5, s=10)
10
11xs = np.linspace(r1_s.min(), r1_s.max(), 100)
12plt.plot(xs * components_90[0, 0] * evr_s[0], xs * components_90[0, 1] * evr_s[0], "r")
13plt.plot(xs * components_90[1, 0] * evr_s[1], xs * components_90[1, 1] * evr_s[1],

We arrange the returns to all use the same scale, pick out just the first two, and recompute PCA for those.

The result is a chart that looks like this.

The scatter plot shows how these two returns move together. The overlaid lines make it easy to see which direction in the paired data is the most dominant according to PCA, letting us spot which relationship matters most.

Model a single asset using components

We can use PCA to compute the factors that best describe returns without specifying what the factors represent. These factors will be portfolios of the assets we are considering. After we pick the portfolios we want to use as factors and compute their returns, we can estimate the loadings with linear regression.

1factor_returns = np.array(
2    [(components_90[i] * returns).T.sum() for i in range(n_components_90)]
3)
4
5mlr = regression.linear_model.OLS(
6    returns.T.iloc[0], sm.add_constant(factor_returns.T)
7).fit()
8print(f"Regression coefficients for {tickers[0]}:\n{mlr.params.round(4)}")

We calculate a value for each principal component, measuring how it relates to the daily returns across assets. Then, we use these values to explain the return movements of a single asset using a statistical model. The printed results show us exactly how much each component helps explain the action in that asset.

Your next steps

You’ve just broken down asset returns and visualized how risk concentrates using PCA. Now swap in different tickers or extend the date range to see how component structure shifts. Try plotting other asset pairs in the scatter section for a feel of sector relationships. Don’t be afraid to tweak n_components to watch how the summary changes with more or fewer factors.