Advanced Dask
Production-ready skill that handles parallel, distributed, computing, scale. Includes structured workflows, validation checks, and reusable patterns for scientific.
Advanced Dask
A scientific computing skill for parallel and distributed computing using Dask — the Python library that scales NumPy, pandas, and scikit-learn workflows to larger-than-memory datasets and multi-core/multi-node execution with familiar APIs.
When to Use This Skill
Choose Advanced Dask when:
- Processing datasets too large to fit in memory with pandas or NumPy
- Parallelizing computation across multiple CPU cores
- Scaling existing pandas/NumPy code to distributed clusters
- Building parallel ETL pipelines with lazy evaluation
Consider alternatives when:
- Your data fits in memory (use pandas/NumPy directly)
- You need SQL-based analytics (use DuckDB or Spark SQL)
- You need real-time streaming (use Apache Kafka/Flink)
- You need GPU acceleration (use RAPIDS cuDF)
Quick Start
claude "Process a 50GB CSV dataset with Dask to compute aggregations"
import dask.dataframe as dd # Read large CSV — lazy, no memory loading yet ddf = dd.read_csv("data/*.csv", assume_missing=True) print(f"Partitions: {ddf.npartitions}") print(f"Columns: {list(ddf.columns)}") # Familiar pandas-like operations (lazy) result = ( ddf .groupby("category") .agg({"sales": "sum", "quantity": "mean", "customer_id": "nunique"}) .rename(columns={"customer_id": "unique_customers"}) ) # Trigger computation computed = result.compute() print(computed.sort_values("sales", ascending=False).head(10))
Core Concepts
Dask Collections
| Collection | Mirrors | Use Case |
|---|---|---|
dask.dataframe | pandas DataFrame | Tabular data, CSV/Parquet |
dask.array | NumPy ndarray | Numerical arrays, HDF5 |
dask.bag | Python iterables | Unstructured/JSON data |
dask.delayed | Custom functions | Arbitrary parallel tasks |
Schedulers
# Single-machine parallelism (default) import dask dask.config.set(scheduler="threads") # I/O-bound dask.config.set(scheduler="processes") # CPU-bound dask.config.set(scheduler="synchronous") # Debugging # Distributed scheduler (multi-core or cluster) from dask.distributed import Client client = Client(n_workers=4, threads_per_worker=2) print(f"Dashboard: {client.dashboard_link}") # Cluster deployment from dask_jobqueue import SLURMCluster cluster = SLURMCluster(cores=8, memory="32GB") cluster.scale(jobs=10) client = Client(cluster)
Lazy Evaluation and Task Graphs
import dask.dataframe as dd ddf = dd.read_parquet("data/") # All operations are lazy — builds task graph filtered = ddf[ddf["value"] > 100] grouped = filtered.groupby("region").sum() sorted_result = grouped.sort_values("revenue", ascending=False) # Visualize the task graph sorted_result.visualize(filename="task_graph.png") # Execute when needed result = sorted_result.compute() # Full result head = sorted_result.head(10) # Partial computation sorted_result.to_parquet("output/") # Write without .compute()
Configuration
| Parameter | Description | Default |
|---|---|---|
scheduler | threads, processes, synchronous, distributed | threads |
n_workers | Number of worker processes | CPU count |
threads_per_worker | Threads per worker | 1 |
memory_limit | Per-worker memory limit | auto |
blocksize | Partition size for CSV reading | 128MB |
Best Practices
-
Use Parquet instead of CSV. Parquet files are columnar, compressed, and support predicate pushdown. Reading Parquet with Dask is 5-10x faster than CSV and uses less memory. Convert CSV to Parquet once, then use Parquet for all analysis.
-
Right-size your partitions. Aim for partitions of 100-200MB each. Too many small partitions create scheduling overhead; too few large partitions limit parallelism and may cause memory errors. Use
ddf.repartition(npartitions=N)to adjust. -
Avoid calling
.compute()prematurely. Chain all transformations first, then call.compute()once at the end. Each.compute()triggers a full graph execution — intermediate computes waste time reprocessing data. -
Use the distributed scheduler for visibility. Even on a single machine,
Client()provides a dashboard (default port 8787) showing task progress, memory usage, and worker activity. This is invaluable for identifying bottlenecks. -
Persist intermediate results for iterative analysis. When you'll reuse a filtered/transformed dataset multiple times, call
ddf = client.persist(ddf)to keep it in distributed memory rather than recomputing from source each time.
Common Issues
Out-of-memory errors despite using Dask. A single partition may exceed worker memory. Reduce partition size with blocksize="64MB" for CSV or ddf.repartition(npartitions=ddf.npartitions*2). Also avoid operations that collect all data to one partition (like sort_values without limits).
Computation is slower than pandas on small data. Dask adds scheduling overhead. For datasets under 1GB, pandas is faster. Dask's advantage appears with data larger than available RAM or when parallelizing CPU-intensive operations across many cores.
Task graph is too complex and crashes the scheduler. Very long chains of operations create massive task graphs. Use ddf.persist() at natural checkpoints to materialize intermediate results and reset the graph complexity. Alternatively, write to Parquet and re-read for a fresh start.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.