A

Advanced Dask

Production-ready skill that handles parallel, distributed, computing, scale. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Advanced Dask

A scientific computing skill for parallel and distributed computing using Dask — the Python library that scales NumPy, pandas, and scikit-learn workflows to larger-than-memory datasets and multi-core/multi-node execution with familiar APIs.

When to Use This Skill

Choose Advanced Dask when:

  • Processing datasets too large to fit in memory with pandas or NumPy
  • Parallelizing computation across multiple CPU cores
  • Scaling existing pandas/NumPy code to distributed clusters
  • Building parallel ETL pipelines with lazy evaluation

Consider alternatives when:

  • Your data fits in memory (use pandas/NumPy directly)
  • You need SQL-based analytics (use DuckDB or Spark SQL)
  • You need real-time streaming (use Apache Kafka/Flink)
  • You need GPU acceleration (use RAPIDS cuDF)

Quick Start

claude "Process a 50GB CSV dataset with Dask to compute aggregations"
import dask.dataframe as dd # Read large CSV — lazy, no memory loading yet ddf = dd.read_csv("data/*.csv", assume_missing=True) print(f"Partitions: {ddf.npartitions}") print(f"Columns: {list(ddf.columns)}") # Familiar pandas-like operations (lazy) result = ( ddf .groupby("category") .agg({"sales": "sum", "quantity": "mean", "customer_id": "nunique"}) .rename(columns={"customer_id": "unique_customers"}) ) # Trigger computation computed = result.compute() print(computed.sort_values("sales", ascending=False).head(10))

Core Concepts

Dask Collections

CollectionMirrorsUse Case
dask.dataframepandas DataFrameTabular data, CSV/Parquet
dask.arrayNumPy ndarrayNumerical arrays, HDF5
dask.bagPython iterablesUnstructured/JSON data
dask.delayedCustom functionsArbitrary parallel tasks

Schedulers

# Single-machine parallelism (default) import dask dask.config.set(scheduler="threads") # I/O-bound dask.config.set(scheduler="processes") # CPU-bound dask.config.set(scheduler="synchronous") # Debugging # Distributed scheduler (multi-core or cluster) from dask.distributed import Client client = Client(n_workers=4, threads_per_worker=2) print(f"Dashboard: {client.dashboard_link}") # Cluster deployment from dask_jobqueue import SLURMCluster cluster = SLURMCluster(cores=8, memory="32GB") cluster.scale(jobs=10) client = Client(cluster)

Lazy Evaluation and Task Graphs

import dask.dataframe as dd ddf = dd.read_parquet("data/") # All operations are lazy — builds task graph filtered = ddf[ddf["value"] > 100] grouped = filtered.groupby("region").sum() sorted_result = grouped.sort_values("revenue", ascending=False) # Visualize the task graph sorted_result.visualize(filename="task_graph.png") # Execute when needed result = sorted_result.compute() # Full result head = sorted_result.head(10) # Partial computation sorted_result.to_parquet("output/") # Write without .compute()

Configuration

ParameterDescriptionDefault
schedulerthreads, processes, synchronous, distributedthreads
n_workersNumber of worker processesCPU count
threads_per_workerThreads per worker1
memory_limitPer-worker memory limitauto
blocksizePartition size for CSV reading128MB

Best Practices

  1. Use Parquet instead of CSV. Parquet files are columnar, compressed, and support predicate pushdown. Reading Parquet with Dask is 5-10x faster than CSV and uses less memory. Convert CSV to Parquet once, then use Parquet for all analysis.

  2. Right-size your partitions. Aim for partitions of 100-200MB each. Too many small partitions create scheduling overhead; too few large partitions limit parallelism and may cause memory errors. Use ddf.repartition(npartitions=N) to adjust.

  3. Avoid calling .compute() prematurely. Chain all transformations first, then call .compute() once at the end. Each .compute() triggers a full graph execution — intermediate computes waste time reprocessing data.

  4. Use the distributed scheduler for visibility. Even on a single machine, Client() provides a dashboard (default port 8787) showing task progress, memory usage, and worker activity. This is invaluable for identifying bottlenecks.

  5. Persist intermediate results for iterative analysis. When you'll reuse a filtered/transformed dataset multiple times, call ddf = client.persist(ddf) to keep it in distributed memory rather than recomputing from source each time.

Common Issues

Out-of-memory errors despite using Dask. A single partition may exceed worker memory. Reduce partition size with blocksize="64MB" for CSV or ddf.repartition(npartitions=ddf.npartitions*2). Also avoid operations that collect all data to one partition (like sort_values without limits).

Computation is slower than pandas on small data. Dask adds scheduling overhead. For datasets under 1GB, pandas is faster. Dask's advantage appears with data larger than available RAM or when parallelizing CPU-intensive operations across many cores.

Task graph is too complex and crashes the scheduler. Very long chains of operations create massive task graphs. Use ddf.persist() at natural checkpoints to materialize intermediate results and reset the graph complexity. Alternatively, write to Parquet and re-read for a fresh start.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates