V

Vaex Dynamic

Production-ready skill that handles skill, processing, analyzing, large. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Vaex Dynamic

Process and visualize billion-row tabular datasets with Vaex, a high-performance Python library that uses lazy evaluation, memory mapping, and out-of-core computation. This skill covers lazy DataFrame operations, virtual columns, fast aggregations, interactive visualizations, and efficient data pipeline construction.

When to Use This Skill

Choose Vaex Dynamic when you need to:

  • Analyze datasets larger than available RAM without Spark or Dask clusters
  • Compute statistics, aggregations, and histograms on billion-row datasets instantly
  • Create interactive visualizations of massive datasets with server-side aggregation
  • Build memory-efficient data pipelines with lazy evaluation

Consider alternatives when:

  • Your data fits in memory (<10GB) (use pandas or Polars)
  • You need distributed computing across a cluster (use Dask or Spark)
  • You need SQL-compatible analytics (use DuckDB)

Quick Start

pip install vaex
import vaex import numpy as np # Create a large dataset (lazy — no memory allocation) n = 10_000_000 df = vaex.from_arrays( x=np.random.randn(n), y=np.random.randn(n) * 2 + 1, category=np.random.choice(['A', 'B', 'C', 'D'], n), value=np.random.exponential(10, n), ) print(f"Rows: {len(df):,}") print(f"Memory (approx): {df.byte_size() / 1e9:.2f} GB") # Lazy operations — computed only when needed df['distance'] = np.sqrt(df.x**2 + df.y**2) # Virtual column df['log_value'] = np.log1p(df.value) # Fast aggregations (vectorized, multi-threaded) stats = df.mean(df.value, binby=df.category) print(f"\nMean value by category:") for cat, mean in zip(['A', 'B', 'C', 'D'], stats): print(f" {cat}: {mean:.2f}") # Filtering (also lazy) filtered = df[df.value > 20] print(f"\nFiltered rows: {len(filtered):,}") print(f"Mean distance (filtered): {filtered.mean(filtered.distance):.4f}") # Export results filtered.export_parquet("high_value.parquet")

Core Concepts

Vaex vs Pandas Comparison

FeatureVaexPandas
Memory modelMemory-mapped / lazyIn-memory (eager)
Max dataset sizeBillions of rows~10M rows (RAM-limited)
Column operationsVirtual (zero-copy)Creates new arrays
FilteringLazy (no data copy)Creates filtered copy
AggregationsMulti-threaded C++Single-threaded Python
String operationsArrow-backed (fast)Python objects (slow)
File formatsHDF5, Arrow, Parquet, CSVCSV, Excel, Parquet

Efficient Data Pipeline

import vaex import numpy as np def build_analysis_pipeline(input_path, output_path): """Process large dataset with lazy evaluation pipeline.""" # Open (memory-mapped, instant regardless of file size) df = vaex.open(input_path) print(f"Loaded: {len(df):,} rows, {len(df.columns)} columns") # Virtual columns (zero memory cost) df['total'] = df.price * df.quantity df['log_total'] = vaex.vlog1p(df.total) df['year'] = df.date.dt.year df['month'] = df.date.dt.month # Lazy filter df_valid = df[df.total > 0] # Binned statistics (extremely fast) yearly_stats = df_valid.groupby('year').agg({ 'total': 'sum', 'quantity': 'mean', 'price': ['mean', 'std'], }) print(yearly_stats) # Percentiles (approximate, O(1) memory) p50 = df_valid.percentile_approx('total', 50) p95 = df_valid.percentile_approx('total', 95) p99 = df_valid.percentile_approx('total', 99) print(f"Total percentiles: p50={p50:.2f}, p95={p95:.2f}, p99={p99:.2f}") # Export filtered results df_valid.export_parquet(output_path) print(f"Exported {len(df_valid):,} valid rows") # build_analysis_pipeline("sales.hdf5", "valid_sales.parquet")

Configuration

ParameterDescriptionDefault
chunk_sizeRows per processing chunkAuto (based on RAM)
thread_countParallel threads for aggregationCPU count
cacheCache computed virtual columnstrue
dtypeDefault data type for new columnsfloat64
file_formatPreferred storage format"hdf5"
memory_limitMaximum memory usageSystem RAM
progress_barShow progress for long operationstrue
string_backendString storage (arrow, python)"arrow"

Best Practices

  1. Store data in HDF5 or Apache Arrow format for instant loading — HDF5 files are memory-mapped by Vaex, meaning vaex.open() is instant regardless of file size. Convert CSVs once with df.export_hdf5("data.hdf5") and use HDF5 for all subsequent analysis. Arrow/Parquet files also work well.

  2. Use virtual columns instead of creating new DataFrames — Virtual columns are computed lazily and use zero memory. df['new_col'] = df.x + df.y doesn't allocate memory — the expression is evaluated on-the-fly during aggregations. Never use .copy() or .values unless you specifically need to materialize.

  3. Prefer binby aggregations over groupby for large cardinalitiesdf.mean(df.value, binby=df.x, limits=[-5, 5], shape=100) computes binned statistics in one pass without creating intermediate groups. This is orders of magnitude faster than groupby for continuous variables.

  4. Use approximate methods for percentiles and unique counts — Exact percentiles require sorting the entire dataset. Use df.percentile_approx(column, percentile) for O(1) memory approximate percentiles and df.nunique(column) with HyperLogLog for approximate unique counts.

  5. Chain filter operations lazily, materialize only at the end — Each filter creates a lazy view, not a copy. Chain multiple filters (df_filtered = df[condition1][condition2][condition3]) freely without memory overhead. Only call .extract() or .export() when you need the final result.

Common Issues

"MemoryError" when calling .values or converting to pandas.values and .to_pandas_df() materialize the entire dataset into RAM. Use aggregations (mean, sum, count), head(), or export to disk instead. If you must convert to pandas, filter down to a manageable subset first.

Virtual columns are slow on repeated access — Virtual columns recompute every time they're accessed. For columns used in multiple aggregations, materialize them once with df.materialize('column_name', inplace=True). This trades memory for computation time.

String operations fail or are slow — Ensure string columns use Arrow-backed storage (df.string_column.values.type should be Arrow, not Python objects). Convert with df['col'] = df.col.as_arrow(). Arrow-backed strings are 10-100x faster for operations like str.contains() and str.lower().

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates