Polars (software)

{{Multiple issues|

}}

{{Infobox software

| name = Polars

| logo = Polars software logo.svg

| logo size =

| screenshot =

| caption =

| author = Ritchie Vink

| developer = Community

| released =

| latest release version =

| latest release date =

| latest preview version =

| latest preview date =

| repo = {{URL|https://github.com/pola-rs/polars/}}

| programming language = Rust

| operating system = Cross-platform

| genre = Technical computing

| license = MIT License

| website = {{URL|pola.rs/}}

}}

Polars is an open-source software library for data manipulation. Polars is built with an OLAP query engine implemented in Rust using Apache Arrow Columnar Format as the memory model. Although built using Rust, there are Python, Node.js, R, and SQL API interfaces to use Polars.

History

The first code to be committed was made on June 23, 2020.{{Cite web |title=Company announcement |url=https://pola.rs/posts/company-announcement/ |access-date=2025-05-13 |website=www.pola.rs |language=en}} Polars started as a "pet project" by Ritchie Vink, who was motivated to fill the gap of a data processing library in the Rust programming language.{{Cite web |date=2025-05-16 |title=Seed funding of US $4M for Polars, one of the fastest DataFrame libraries in Python & Rust - Xomnia |url=https://xomnia.com/post/ritchie-vink-writes-polars-one-of-the-fastest-dataframe-libraries-in-python-and-rust/ |access-date=2025-05-30 |language=en-US}}

Ritchie Vink and Chiel Peters co-founded a company to develop Polars, after working together at the company Xomnia for five years. In 2023, Vink and Peters successfully closed a seed round of approximately $4 million, which was led by Bain Capital Ventures.

Features

The core object in Polars is the DataFrame, similar to other data processing software libraries.{{Cite web |last=Python |first=Real |title=Python Polars: A Lightning-Fast DataFrame Library – Real Python |url=https://realpython.com/polars-python/ |access-date=2025-05-13 |website=realpython.com |language=en}} Contexts and expressions are important concepts to Polars' syntax. A context is the specific environment in which an expression is evaluated. Meanwhile, an expression refers to computations or transformations that are performed on data columns.

Polars has three main contexts:

selection: choosing columns from a DataFrame
filtering: subset a DataFrame by keeping rows that meet specified conditions
group by/aggregation: calculating summary statistics within subgroups of the data

Polars was also designed to be "intuitive and [have] concise syntax for data processing tasks".{{Cite web |title=1. Introducing Polars - Python Polars: The Definitive Guide [Book] |url=https://www.oreilly.com/library/view/python-polars-the/9781098156077/ch01.html |access-date=2025-05-30 |website=www.oreilly.com |language=en}}

Compared with other data processing software

= Compared to pandas =

== Feature differences ==

Given that Polars was designed to work on a single machine, this prompts many comparisons with the similar data manipulation software, pandas.{{Cite web |date=2024-07-04 |title=Polars vs. pandas: What's the Difference? {{!}} The PyCharm Blog |url=https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/ |access-date=2025-05-13 |website=The JetBrains Blog |language=en-US}} One big advantage that Polars has over pandas is performance, where Polars is 5 to 10 times faster than pandas on similar tasks. Additionally, pandas requires around 5 to 10 times as much RAM as the size of the dataset, which compares to the 2 to 4 times needed for Polars. These performance increases may be due to Polars being written in Rust and supporting parallel operations.{{Cite web |title=Using the Polars DataFrame Library |url=https://www.codemag.com/Article/2212051/Using-the-Polars-DataFrame-Library |access-date=2025-05-30 |website=CODE |language=en-us}}

Polars is also designed to use lazy evaluation (where a query optimizer will use the most efficient evaluation after looking at all steps) compared with pandas using eager evaluation (where steps are performed immediately). Some research on comparing pandas and Polars completing data analysis tasks show that Polars is more memory-efficient than pandas,{{Cite book |last1=Nahrstedt |first1=Felix |last2=Karmouche |first2=Mehdi |last3=Bargieł |first3=Karolina |last4=Banijamali |first4=Pouyeh |last5=Nalini Pradeep Kumar |first5=Apoorva |last6=Malavolta |first6=Ivano |chapter=An Empirical Study on the Energy Usage and Performance of Pandas and Polars Data Analysis Python Libraries |date=2024-06-18 |title=Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering |chapter-url=https://dl.acm.org/doi/abs/10.1145/3661167.3661203 |series=EASE '24 |location=New York, NY, USA |publisher=Association for Computing Machinery |pages=58–68 |doi=10.1145/3661167.3661203 |isbn=979-8-4007-1701-7|url=https://research.vu.nl/en/publications/4c1fb8b0-4d68-4178-92a3-787fd451c87f }} where "Polars consumes 63% of the energy needed by pandas on the TPC-H benchmark and uses eight times less energy than pandas on synthetic data".

Polars does not have an index for the DataFrame object, which contrasts pandas' use of an index.

== Syntax differences ==

Polars and pandas have similar syntax for reading in data using a read_csv() method, but have different syntax for calculating a rolling mean.{{Cite web |date=2024-06-19 |title=How to Move From pandas to Polars {{!}} The PyCharm Blog |url=https://blog.jetbrains.com/pycharm/2024/06/how-to-move-from-pandas-to-polars/ |access-date=2025-05-13 |website=The JetBrains Blog |language=en-US}}

Code using pandas:

import pandas as pd

Read in data

df_temp = pd.read_csv(

"temp_record.csv", index_col="date", parse_dates=True, dtype={"temp": int}

)

Explore data

print(df_temp.dtypes)

print(df_temp.head())

Calculate rolling mean

df_temp.rolling(2).mean()

Code using Polars:

import polars as pl

Read in data

df_temp = pl.read_csv(

"temp_record.csv", try_parse_dates=True, dtypes={"temp": int}

).set_sorted("date")

Explore data

print(df_temp.dtypes)

print(df_temp.head())

Calculate rolling average

df_temp.rolling("date", period="2d").agg(pl.mean("temp"))

= Compared to Dask =

Dask is a Python package for applying parallel computation using NumPy, pandas, and scikit-learn, and is used for datasets that are larger than what can fit in memory. Polars is for single-machine use, while Dask is more for distributed computing.

= Compared to DuckDB =

DuckDB is an in-process SQL OLAP database system for efficient analytical queries on structured data. Both DuckDB and Polars offer excellent analytical performance, but DuckDB is more SQL-centric for running queries, while Polars is Python-centric.

= Compared to Spark =

Apache Spark has a Python API, PySpark, for distributed big data processing. Similar to Dask, Spark is focused on distributed computing, while Polars is for single-machine use. So Polars has an advantage when processing data on a single machine, while Spark may be preferred for larger datasets that don't fit on a single machine.