Vaex Dynamic
Production-ready skill that handles skill, processing, analyzing, large. Includes structured workflows, validation checks, and reusable patterns for scientific.
Vaex Dynamic
Process and visualize billion-row tabular datasets with Vaex, a high-performance Python library that uses lazy evaluation, memory mapping, and out-of-core computation. This skill covers lazy DataFrame operations, virtual columns, fast aggregations, interactive visualizations, and efficient data pipeline construction.
When to Use This Skill
Choose Vaex Dynamic when you need to:
- Analyze datasets larger than available RAM without Spark or Dask clusters
- Compute statistics, aggregations, and histograms on billion-row datasets instantly
- Create interactive visualizations of massive datasets with server-side aggregation
- Build memory-efficient data pipelines with lazy evaluation
Consider alternatives when:
- Your data fits in memory (<10GB) (use pandas or Polars)
- You need distributed computing across a cluster (use Dask or Spark)
- You need SQL-compatible analytics (use DuckDB)
Quick Start
pip install vaex
import vaex import numpy as np # Create a large dataset (lazy — no memory allocation) n = 10_000_000 df = vaex.from_arrays( x=np.random.randn(n), y=np.random.randn(n) * 2 + 1, category=np.random.choice(['A', 'B', 'C', 'D'], n), value=np.random.exponential(10, n), ) print(f"Rows: {len(df):,}") print(f"Memory (approx): {df.byte_size() / 1e9:.2f} GB") # Lazy operations — computed only when needed df['distance'] = np.sqrt(df.x**2 + df.y**2) # Virtual column df['log_value'] = np.log1p(df.value) # Fast aggregations (vectorized, multi-threaded) stats = df.mean(df.value, binby=df.category) print(f"\nMean value by category:") for cat, mean in zip(['A', 'B', 'C', 'D'], stats): print(f" {cat}: {mean:.2f}") # Filtering (also lazy) filtered = df[df.value > 20] print(f"\nFiltered rows: {len(filtered):,}") print(f"Mean distance (filtered): {filtered.mean(filtered.distance):.4f}") # Export results filtered.export_parquet("high_value.parquet")
Core Concepts
Vaex vs Pandas Comparison
| Feature | Vaex | Pandas |
|---|---|---|
| Memory model | Memory-mapped / lazy | In-memory (eager) |
| Max dataset size | Billions of rows | ~10M rows (RAM-limited) |
| Column operations | Virtual (zero-copy) | Creates new arrays |
| Filtering | Lazy (no data copy) | Creates filtered copy |
| Aggregations | Multi-threaded C++ | Single-threaded Python |
| String operations | Arrow-backed (fast) | Python objects (slow) |
| File formats | HDF5, Arrow, Parquet, CSV | CSV, Excel, Parquet |
Efficient Data Pipeline
import vaex import numpy as np def build_analysis_pipeline(input_path, output_path): """Process large dataset with lazy evaluation pipeline.""" # Open (memory-mapped, instant regardless of file size) df = vaex.open(input_path) print(f"Loaded: {len(df):,} rows, {len(df.columns)} columns") # Virtual columns (zero memory cost) df['total'] = df.price * df.quantity df['log_total'] = vaex.vlog1p(df.total) df['year'] = df.date.dt.year df['month'] = df.date.dt.month # Lazy filter df_valid = df[df.total > 0] # Binned statistics (extremely fast) yearly_stats = df_valid.groupby('year').agg({ 'total': 'sum', 'quantity': 'mean', 'price': ['mean', 'std'], }) print(yearly_stats) # Percentiles (approximate, O(1) memory) p50 = df_valid.percentile_approx('total', 50) p95 = df_valid.percentile_approx('total', 95) p99 = df_valid.percentile_approx('total', 99) print(f"Total percentiles: p50={p50:.2f}, p95={p95:.2f}, p99={p99:.2f}") # Export filtered results df_valid.export_parquet(output_path) print(f"Exported {len(df_valid):,} valid rows") # build_analysis_pipeline("sales.hdf5", "valid_sales.parquet")
Configuration
| Parameter | Description | Default |
|---|---|---|
chunk_size | Rows per processing chunk | Auto (based on RAM) |
thread_count | Parallel threads for aggregation | CPU count |
cache | Cache computed virtual columns | true |
dtype | Default data type for new columns | float64 |
file_format | Preferred storage format | "hdf5" |
memory_limit | Maximum memory usage | System RAM |
progress_bar | Show progress for long operations | true |
string_backend | String storage (arrow, python) | "arrow" |
Best Practices
-
Store data in HDF5 or Apache Arrow format for instant loading — HDF5 files are memory-mapped by Vaex, meaning
vaex.open()is instant regardless of file size. Convert CSVs once withdf.export_hdf5("data.hdf5")and use HDF5 for all subsequent analysis. Arrow/Parquet files also work well. -
Use virtual columns instead of creating new DataFrames — Virtual columns are computed lazily and use zero memory.
df['new_col'] = df.x + df.ydoesn't allocate memory — the expression is evaluated on-the-fly during aggregations. Never use.copy()or.valuesunless you specifically need to materialize. -
Prefer
binbyaggregations overgroupbyfor large cardinalities —df.mean(df.value, binby=df.x, limits=[-5, 5], shape=100)computes binned statistics in one pass without creating intermediate groups. This is orders of magnitude faster than groupby for continuous variables. -
Use approximate methods for percentiles and unique counts — Exact percentiles require sorting the entire dataset. Use
df.percentile_approx(column, percentile)for O(1) memory approximate percentiles anddf.nunique(column)with HyperLogLog for approximate unique counts. -
Chain filter operations lazily, materialize only at the end — Each filter creates a lazy view, not a copy. Chain multiple filters (
df_filtered = df[condition1][condition2][condition3]) freely without memory overhead. Only call.extract()or.export()when you need the final result.
Common Issues
"MemoryError" when calling .values or converting to pandas — .values and .to_pandas_df() materialize the entire dataset into RAM. Use aggregations (mean, sum, count), head(), or export to disk instead. If you must convert to pandas, filter down to a manageable subset first.
Virtual columns are slow on repeated access — Virtual columns recompute every time they're accessed. For columns used in multiple aggregations, materialize them once with df.materialize('column_name', inplace=True). This trades memory for computation time.
String operations fail or are slow — Ensure string columns use Arrow-backed storage (df.string_column.values.type should be Arrow, not Python objects). Convert with df['col'] = df.col.as_arrow(). Arrow-backed strings are 10-100x faster for operations like str.contains() and str.lower().
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.