How to download (and clean) free market data

How to download (and clean) free market data
Bad market data ruins your strategies.
If you use Yahoo Finance in Python, you've seen gaps and incorrect prices. Bad data wrecks backtests and P&L.
Pros fix this problem with Python.
Today’s newsletter includes Python code to download, clean, and check Yahoo Finance data.
Here's how it works.
How to download (and clean) free market data
Getting clean financial market data in Python means building rock-solid workflows to fetch, validate, and document historical prices—essential for any quant or algorithmic trading project. Unclean data leads to misleading backtests, broken models, and results you can't trust.
Historically, as finance moved from spreadsheets to automated trading, easy-access sources like Yahoo Finance became crucial for quant research and prototyping.
Yahoo Finance started out as a free data hub, quickly adopted by retail traders and professionals for its broad ticker coverage. Despite no official API, it’s reliably underpinned many early-stage strategies, but long-standing gaps and anomalies have made raw downloads risky for anything beyond research. As quant finance matured, practitioners learned the hard way that dirty data destroys performance and credibility.
Modern workflow demands strict, transparent cleaning: logging every adjustment, checking splits/dividends, fixing date gaps, syncing to official exchange calendars, and always saving what changed. Relying on automated pipelines—plus cross-checks with another data source—keeps your research or model code honest and robust in production.
Real pros don't shortcut data integrity. Anyone running backtests, preparing quant strategies, or building trading tools in Python builds these validation routines into their daily process.
Let's see how it works with Python.
Imports and setup
We rely on these libraries to pull stock, options, and financial data from Yahoo Finance, and to handle and organize the results in convenient tables.
1import pandas as pd
2import yfinance as yf
We're defining a small list of stock tickers that we'll analyze in the next steps.
1stock_symbols = ["AAPL", "MSFT", "GOOGL"]
Here, we're pulling past market prices, available options, and key financial statements for each company in our list.
1market_data = {
2 symbol: yf.Ticker(symbol).history(period="1y") for symbol in stock_symbols
3}
4options_data = {symbol: yf.Ticker(symbol).option_chain() for symbol in stock_symbols}
5financials_data = {symbol: yf.Ticker(symbol).financials for symbol in stock_symbols}
This block sets up everything we need to work with data from Yahoo Finance. We start with our list of ticker symbols. Using a standard interface, we reach out to Yahoo Finance and grab a year of daily trading data, complete option chain data, and recent financials for each ticker. We organize all of this into dictionaries with the ticker symbol as the key for easy reference later.
Organize and standardize financial data
We're reformatting and cleaning the stock market price data so it's consistent and easy to work with later.
1standardized_market_data = []
2for symbol, df in market_data.items():
3 if not df.empty:
4 df = df[["Open", "High", "Low", "Close", "Volume"]]
5 df = df.dropna()
6 df = df.astype(float)
7 df["Symbol"] = symbol
8 standardized_market_data.append(df)
9market_df = pd.concat(standardized_market_data).reset_index()
This section loops through each stock's daily trading data. We pick out the price and volume columns that matter most, remove any missing data, and make sure all numbers are in a usable format. We add a column so we always know which row belongs to which company. At the end, we stack all the companies' data together in a single, neat table, and include the date as a useful column.
We reshape the options data so each company’s calls and puts are formatted the same way and combined in one spot.
1standardized_options_data = []
2for symbol, option_chain in options_data.items():
3 calls = option_chain.calls
4 puts = option_chain.puts
5 for opt_type, opt_df in [("call", calls), ("put", puts)]:
6 if not opt_df.empty:
7 opt_df = opt_df[
8 [
9 "contractSymbol",
10 "strike",
11 "lastPrice",
12 "bid",
13 "ask",
14 "volume",
15 "openInterest",
16 ]
17 ]
18 opt_df = opt_df.dropna()
19 opt_df = opt_df.astype(
20 {
21 "strike": float,
22 "lastPrice": float,
23 "bid": float,
24 "ask": float,
25 "volume": float,
26 "openInterest": float,
27 }
28 )
29 opt_df["Type"] = opt_type
30 opt_df["Symbol"] = symbol
31 standardized_options_data.append(opt_df)
32options_df = pd.concat(standardized_options_data).reset_index(drop=True)
For each company, we work through their option contracts—both calls and puts. We pick out contract IDs, strike prices, recent trading prices, and the amount of interest from buyers. Everything is cleaned up, filled with numbers only, and tagged with the type and company name. We pull all companies' options together into a single big table, keeping things lined up and ready for analysis.
We’re setting up the financial statement data so every company is displayed in the same way, making it easy to compare performance.
1standardized_financials_data = []
2for symbol, df in financials_data.items():
3 if not df.empty:
4 df = df.transpose()
5 df["Symbol"] = symbol
6 df = df.dropna(axis=1, how="all")
7 standardized_financials_data.append(df)
8financials_df = pd.concat(standardized_financials_data).reset_index()
Here, we loop through the companies and gather their income statements and other key financials. We flip the layout to make each year a row, which is easier to pull apart later. We also hide any columns where all companies have missing values, and label every row by company. The final product is a side-by-side view of the most recent financials for each stock on our list.
Clean and finalize data frames
We're making a final check for missing values in our results and removing any incomplete rows for reliable analysis.
1if market_df.isnull().values.any():
2 market_df = market_df.dropna()
3if options_df.isnull().values.any():
4 options_df = options_df.dropna()
Before our data is ready for use, we take one last sweep through the key tables and delete any rows with blanks. This makes sure future analysis isn’t thrown off by empty entries. Now, everything is tidy and set for use in models, charts, or whatever next step we want to take.
Your next steps
You’ve set up clean, ready-to-use data frames for stocks, options, and financials. Swap in new tickers in stock_symbols to try different companies. Adjust the period in .history() to pull longer or shorter timeframes. Once comfortable, dig into options_df and market_df to run your own calculations or build charts—you’ve got almost all you need.
