Master Polars
All-in-one skill covering fast, dataframe, library, apache. Includes structured workflows, validation checks, and reusable patterns for scientific.
Master Polars
Process and analyze large datasets with Polars, a high-performance DataFrame library built on Apache Arrow. This skill covers the expression-based API, lazy evaluation, data manipulation, aggregation, and performance optimization for data engineering and analytics workloads.
When to Use This Skill
Choose Master Polars when you need to:
- Process datasets that are too large for pandas to handle efficiently in memory
- Build data pipelines with lazy evaluation and query optimization
- Perform fast aggregations, joins, and window operations on tabular data
- Replace pandas in performance-critical applications while keeping similar syntax
Consider alternatives when:
- You need distributed computing across multiple machines (use PySpark or Dask)
- You need deep integration with scikit-learn or statsmodels (use pandas, which has broader ecosystem support)
- You need GPU-accelerated DataFrames (use cuDF)
Quick Start
pip install polars
import polars as pl # Create a DataFrame df = pl.DataFrame({ "name": ["Alice", "Bob", "Charlie", "Diana"], "age": [30, 25, 35, 28], "salary": [70000, 55000, 90000, 65000], "department": ["Engineering", "Marketing", "Engineering", "Marketing"] }) # Expression-based operations result = df.select( pl.col("name"), pl.col("salary").mean().over("department").alias("dept_avg_salary"), (pl.col("salary") - pl.col("salary").mean().over("department")).alias("salary_diff") ) print(result)
Core Concepts
Polars vs Pandas
| Feature | Polars | Pandas |
|---|---|---|
| Backend | Apache Arrow (columnar) | NumPy (row-oriented) |
| Evaluation | Lazy + eager | Eager only |
| Parallelism | Multi-threaded by default | Single-threaded |
| Memory | Zero-copy, memory-mapped | Copy-heavy |
| Missing values | Null (Arrow native) | NaN (float-based) |
| String handling | Arrow UTF-8 (fast) | Python objects (slow) |
| API style | Expression-based | Method chaining |
Lazy Evaluation and Query Optimization
import polars as pl # Lazy mode: build a query plan, execute optimized lazy_df = pl.scan_csv("large_dataset.csv") result = ( lazy_df .filter(pl.col("year") >= 2020) .group_by("category") .agg([ pl.col("revenue").sum().alias("total_revenue"), pl.col("revenue").mean().alias("avg_revenue"), pl.col("customer_id").n_unique().alias("unique_customers"), pl.col("order_id").count().alias("order_count") ]) .sort("total_revenue", descending=True) .head(20) ) # Show the optimized query plan print(result.explain()) # Execute the query df_result = result.collect() print(df_result)
Advanced Expressions
import polars as pl from datetime import datetime # Window functions, conditionals, string operations df = pl.DataFrame({ "date": pl.date_range(datetime(2024, 1, 1), datetime(2024, 12, 31), eager=True), "value": [i * 1.1 + (i % 7) * 5 for i in range(366)], "category": ["A", "B", "C"] * 122 }) result = df.with_columns([ # Rolling average pl.col("value").rolling_mean(window_size=7).alias("7d_avg"), # Rank within category pl.col("value").rank("dense").over("category").alias("rank_in_category"), # Conditional column pl.when(pl.col("value") > 200) .then(pl.lit("high")) .when(pl.col("value") > 100) .then(pl.lit("medium")) .otherwise(pl.lit("low")) .alias("tier"), # Date extraction pl.col("date").dt.month().alias("month"), pl.col("date").dt.weekday().alias("weekday"), # Cumulative sum per category pl.col("value").cum_sum().over("category").alias("cumulative") ]) # Complex aggregation summary = result.group_by("month", "tier").agg([ pl.col("value").mean().alias("avg_value"), pl.col("value").std().alias("std_value"), pl.len().alias("count") ]).sort(["month", "tier"]) print(summary)
Data Pipeline
import polars as pl def build_analytics_pipeline(sales_path, customers_path): """Build an optimized analytics pipeline with lazy evaluation.""" sales = pl.scan_csv(sales_path) customers = pl.scan_csv(customers_path) pipeline = ( sales .join(customers, on="customer_id", how="left") .with_columns([ pl.col("order_date").str.to_datetime("%Y-%m-%d"), (pl.col("quantity") * pl.col("unit_price")).alias("revenue") ]) .filter(pl.col("order_date").dt.year() == 2024) .group_by(["customer_segment", "product_category"]) .agg([ pl.col("revenue").sum().alias("total_revenue"), pl.col("order_id").n_unique().alias("orders"), pl.col("customer_id").n_unique().alias("customers"), (pl.col("revenue").sum() / pl.col("order_id").n_unique()) .alias("avg_order_value") ]) .sort("total_revenue", descending=True) ) return pipeline.collect() # Execute pipeline results = build_analytics_pipeline("sales.csv", "customers.csv")
Configuration
| Parameter | Description | Default |
|---|---|---|
n_threads | Thread pool size | System CPU count |
streaming | Enable streaming mode for large data | false |
rechunk | Consolidate memory after operations | true |
null_values | Strings to interpret as null | ["", "null", "NA"] |
dtypes | Column type overrides | Auto-detected |
low_memory | Reduce memory usage (slower) | false |
Best Practices
-
Use lazy mode for multi-step pipelines — Start with
pl.scan_csv()ordf.lazy()and chain operations before calling.collect(). The query optimizer pushes filters down, projects only needed columns, and parallelizes operations automatically. -
Prefer expressions over
applyand loops — Polars expressions (pl.col(),pl.when(),.over()) run as optimized Rust code. Using.map_elements()or Python loops falls back to slow Python execution, negating Polars' performance advantage. -
Use
streaming=Truefor datasets larger than RAM — When calling.collect(streaming=True), Polars processes data in batches rather than loading everything at once. This enables processing 100GB+ datasets on machines with 16GB RAM. -
Specify dtypes when reading CSV files — Auto-type detection scans the entire file and can guess wrong (e.g., zip codes as integers). Pass
dtypes={"zip_code": pl.Utf8}to avoid data corruption and speed up file reading. -
Use
.sink_parquet()for large output files — For lazy pipelines producing large results, use.sink_parquet("output.parquet")instead of.collect()followed by.write_parquet(). Sink mode streams results directly to disk without materializing the full DataFrame in memory.
Common Issues
Expression type mismatch errors — Polars is strictly typed, unlike pandas. Mixing integers and strings in a column or comparing incompatible types raises errors rather than silently coercing. Cast explicitly with .cast(pl.Int64) or .cast(pl.Utf8) before operations that combine different types.
"column not found" in lazy mode — Lazy mode doesn't validate column names until execution. Typos in column names pass silently during query building and only fail at .collect(). Use df.columns to verify column names before building lazy queries.
Pandas-to-Polars conversion losing data — When converting with pl.from_pandas(pandas_df), pandas' NaN values for non-float columns (strings, integers) may not convert correctly. Clean missing values in pandas first (fillna or dropna), or convert column by column with explicit null handling.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.