8 ways pandas really losing to Polars for quick market data analysis
8 ways pandas really losing to Polars for quick market data analysis
In today’s newsletter, you’ll use Polars, a high-speed data-handling tool that's becoming essential in quantitative finance and algorithmic trading.
You’ll see how to compare its performance to pandas for many common data manipulation techniques.
By the end of this post, you'll understand how Polars can improve your data processing speed, especially when working with large datasets.
So if you've been looking for a more efficient DataFrame library, this is one issue you don't want to miss.
Let's dive in!
8 ways pandas really losing to Polars for quick market data analysis
Polars is a DataFrame library designed for speed and efficiency.
It’s written in Rust and uses parallel execution to process data across multiple CPU cores. This makes it faster than many other DataFrame libraries, including pandas, making it a good choice for tasks that involve large amounts of data.
Despite being written in Rust, Polars provides a Python API that is easy to use and familiar to those who have experience with Python.
This makes it accessible to a wide range of users, including data scientists and researchers.
The choice between the two will depend on the size of your data and how crucial performance is for your work.
Imports and set up
Make sure you run the code in a Jupyter Notebook so you can use the %timeit magic. Then, start by importing pandas, Polars, and OpenBB.
1import pandas as pd
2import polars as pl
3
4from openbb_terminal.sdk import openbb
5We’ll run our tests with 30 years of price data for the 500 stocks currently in the S&P 500. The resulting DataFrame is 32.5MB which is not huge but big enough for testing.
6
7url = "http://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
8table = pd.read_html(url)[0]
9tickers = table.Symbol.tolist()
10
11df_pandas = openbb.economy.index(tickers, start_date="1990-01-01")
12As you might expect, you can convert a pandas DataFrame to a Polars DataFrame.
13
14df_polars = pl.from_pandas(df_pandas)
Now we're ready.
Reading data from CSV
Reading data from CSVs is common. Here’s how to do it.
1# pandas
2%timeit pd.read_csv("data.csv")
3# 458 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4
5# polars
6%timeit pl.scan_csv("data.csv")
7# 3.57 ms ± 4.01 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If you’ve never seen the output of %timeit before, the first number is the average time it takes to run the operation. In this case pandas took 458 ms per loop and Polars took 3.57 ms per loop.
Polars is 99% faster at reading data from a CSV than pandas.
Selecting data
Selecting data from from columns is also common.
1selected = tickers[:100]
2
3# pandas
4%timeit df_pandas[selected]
5# 673 µs ± 22.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
6
7# polars
8%timeit df_polars.select(pl.col(selected))
9# 399 µs ± 1.37 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Notice the difference in syntax. Polars requires the list of columns to selected be wrapped in the pl.col method.
Filtering data
How about filtering data.
1# pandas
2%timeit df_pandas[df_pandas["GE"] > 100]
3# 2.55 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4
5# polars
6%timeit df_polars.filter(pl.col("GE") > 100)
7# 1.27 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Polars takes about half the time for simple filter operations.
Grouping data
1# pandas
2%timeit df_pandas.groupby("GE").mean()
3# 113 ms ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
4
5# polars
6%timeit df_polars.groupby("GE").mean()
7# 16.5 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
pandas groups and aggregates in 113 ms while Polars does it in 16.5 ms.
Adding new columns
1# pandas
2%timeit df_pandas.assign(GE_Return=df_pandas["GE"].pct_change())
3# 3.67 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4
5# polars
6%timeit df_polars.with_columns((pl.col("GE").pct_change()).alias("GE_return"))
7# 89 µs ± 3.57 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Polars is slower than pandas when filling nulls and nans by about 3x.
Imputing missing data
1%timeit df_pandas["GE"].fillna(0)
2# 33.2 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
3
4# Polars
5%timeit df_polars.with_columns(pl.col("GE").fill_null(0))
6# 81.5 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Sorting data
1# pandas
2%timeit df_pandas.sort_values("GE")
3# 6.45 ms ± 48.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4
5# polars
6%timeit df_polars.sort("GE")
7# 4.54 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
pandas and Polars are pretty close in sorting, but Polars is still faster.
Calculating rolling statistics
1# pandas
2%timeit df_pandas.GE.rolling(window=20).mean()
3# 184 µs ± 274 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
4
5# polars
6%timeit df_polars.with_columns(pl.col("GE").rolling_mean(20))
7# 103 µs ± 323 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Polars beats pandas in a simple 20-day rolling mean calculation.
This demonstration only scratches the surface of Polars. It’s also important to note the syntax is different from pandas so there is a learning curve to use it. And as always, it’s important to use the tool that does the job for you. If you’re dealing with massive data sets of tens or hundreds of GBs, then Polars is a good option. If not, then pandas will work fine.
Action steps
Your action steps today are to get Polars installed and start getting comfortable with the syntax. Then, download a multi-gigabyte dataset from your favorite source and run some tests on your own.