Pro Data Workspace

A comprehensive skill for setting up and managing data analysis workspaces — covering environment configuration, dataset loading, exploratory data analysis workflows, visualization pipelines, and reproducible analysis notebooks.

When to Use This Skill

Choose Pro Data Workspace when you need to:

Set up a complete data analysis environment from scratch
Load, clean, and explore datasets systematically
Build reproducible analysis pipelines with clear documentation
Generate statistical summaries and visualizations
Export analysis results in presentation-ready formats

Consider alternatives when:

You need real-time data streaming (use a streaming pipeline skill)
You're building ML models (use a machine learning skill)
You need database administration (use a database management skill)

Quick Start


# Set up a data analysis workspace
claude "Set up a Python data workspace for analyzing a CSV of e-commerce transactions. Include pandas, visualization, and export to Excel."


# workspace_setup.py
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Configure workspace
WORKSPACE = Path("./analysis")
WORKSPACE.mkdir(exist_ok=True)
DATA_DIR = WORKSPACE / "data"
OUTPUT_DIR = WORKSPACE / "output"
DATA_DIR.mkdir(exist_ok=True)
OUTPUT_DIR.mkdir(exist_ok=True)

# Load and inspect data
df = pd.read_csv("transactions.csv", parse_dates=["order_date"])
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"Date range: {df['order_date'].min()} to {df['order_date'].max()}")
print(f"\nMissing values:\n{df.isnull().sum()}")

# Quick EDA
summary = df.describe(include="all")
summary.to_excel(OUTPUT_DIR / "summary_statistics.xlsx")

# Revenue by category
revenue = df.groupby("category")["revenue"].sum().sort_values(ascending=False)
fig, ax = plt.subplots(figsize=(10, 6))
revenue.plot(kind="bar", ax=ax, color="#2563EB")
ax.set_title("Revenue by Category")
ax.set_ylabel("Revenue ($)")
plt.tight_layout()
plt.savefig(OUTPUT_DIR / "revenue_by_category.png", dpi=150)

Core Concepts

Workspace Structure

Directory	Purpose	Contents
`data/`	Raw and processed datasets	CSV, Parquet, JSON files
`notebooks/`	Jupyter analysis notebooks	.ipynb files
`scripts/`	Reusable analysis scripts	.py files
`output/`	Charts, reports, exports	PNG, XLSX, PDF files
`config/`	Environment and parameter files	.yaml, .env files

Data Loading Patterns


# Multi-format data loading
import pandas as pd

loaders = {
    ".csv": lambda f: pd.read_csv(f),
    ".xlsx": lambda f: pd.read_excel(f),
    ".json": lambda f: pd.read_json(f),
    ".parquet": lambda f: pd.read_parquet(f),
}

def load_data(filepath):
    ext = Path(filepath).suffix.lower()
    loader = loaders.get(ext)
    if not loader:
        raise ValueError(f"Unsupported format: {ext}")
    df = loader(filepath)
    print(f"Loaded {len(df)} rows, {len(df.columns)} columns from {filepath}")
    return df

# Data profiling helper
def profile(df):
    return pd.DataFrame({
        "dtype": df.dtypes,
        "non_null": df.count(),
        "null_pct": (df.isnull().sum() / len(df) * 100).round(1),
        "unique": df.nunique(),
        "sample": df.iloc[0],
    })

Visualization Pipeline


# Reusable chart configuration
import matplotlib.pyplot as plt
import seaborn as sns

def setup_style():
    sns.set_theme(style="whitegrid")
    plt.rcParams.update({
        "figure.figsize": (10, 6),
        "figure.dpi": 150,
        "font.size": 12,
        "axes.titlesize": 14,
        "axes.labelsize": 12,
    })

def save_chart(fig, name, output_dir="./output"):
    path = Path(output_dir) / f"{name}.png"
    fig.savefig(path, bbox_inches="tight", dpi=150)
    print(f"Saved: {path}")
    plt.close(fig)

Configuration

Parameter	Description	Example
`data_source`	Input file path or URL	`"./data/sales.csv"`
`date_column`	Column to parse as datetime	`"order_date"`
`output_format`	Export format for results	`"xlsx"` / `"csv"`
`chart_style`	Seaborn/matplotlib theme	`"whitegrid"`
`dpi`	Chart resolution for exports	`150`
`workspace_dir`	Root directory for analysis files	`"./analysis"`

Best Practices

Separate raw data from processed data — Never modify source files. Load raw data, transform in memory, and save processed versions to a separate directory. This lets you rerun analysis from scratch when requirements change.
Profile every dataset before analysis — Run null counts, dtype checks, and value distributions before writing any analysis code. Five minutes of profiling saves hours of debugging malformed data downstream.
Use consistent naming for output files — Name outputs with the analysis date and a descriptive tag: 2024-12-15_revenue_by_category.png. When you generate dozens of charts, timestamps prevent confusion about which version is current.
Pin your dependencies with exact versions — Create a requirements.txt with pinned versions (pandas==2.1.4 not pandas>=2.0). Analysis that can't be reproduced six months later has limited value for auditing or extending.
Document assumptions inline with code — Add comments explaining why you filtered rows, chose specific date ranges, or excluded outliers. The code shows what you did; comments explain why, which is critical for peer review.

Common Issues

Memory errors on large datasets — Loading a 5GB CSV into pandas on a 16GB machine will fail. Use dtype specifications to reduce memory (e.g., category for string columns with few unique values), load in chunks with chunksize, or switch to Polars/DuckDB for larger-than-memory datasets.

Date parsing fails silently — Pandas may load dates as strings without error if the format doesn't match expectations. Always pass parse_dates=["col"] explicitly and verify with df["col"].dtype. Mixed date formats in a single column need pd.to_datetime(df["col"], format="mixed").

Charts look different in exports vs notebooks — Matplotlib renders differently depending on the backend (inline notebook vs. file export). Always call plt.tight_layout() before saving, set explicit figsize and dpi, and preview the saved PNG file rather than relying on notebook rendering for final output.

⚠️ Loading Issue

Pro Data Workspace

Pro Data Workspace

When to Use This Skill

Quick Start

Core Concepts

Workspace Structure

Data Loading Patterns

Visualization Pipeline

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace