Pro Exploratory Workspace
Boost productivity using this perform, comprehensive, exploratory, data. Includes structured workflows, validation checks, and reusable patterns for scientific.
Pro Exploratory Workspace
A scientific computing skill for interactive exploratory data analysis (EDA) in scientific research. Pro Exploratory Workspace provides structured workflows for investigating datasets, generating hypotheses, identifying patterns, and creating visualizations that guide deeper analysis in any scientific domain.
When to Use This Skill
Choose Pro Exploratory Workspace when:
- Starting analysis of a new scientific dataset with unknown patterns
- Performing quality control and data validation before formal analysis
- Generating exploratory visualizations to guide hypothesis formation
- Building data profiling reports for collaborators
Consider alternatives when:
- You have a specific analysis plan (use the appropriate statistical tool)
- You need automated ML model building (use AutoML tools)
- You're working with domain-specific data (use domain-specific analysis tools)
- You need publication-ready figures (use matplotlib/seaborn with custom styling)
Quick Start
claude "Explore this gene expression dataset and identify interesting patterns"
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load and profile dataset df = pd.read_csv("experiment_data.csv") # Data overview print("=== Dataset Profile ===") print(f"Shape: {df.shape}") print(f"Columns: {list(df.columns)}") print(f"\nData types:\n{df.dtypes}") print(f"\nMissing values:\n{df.isnull().sum()}") print(f"\nDescriptive stats:\n{df.describe()}") # Distribution analysis fig, axes = plt.subplots(2, 3, figsize=(15, 10)) numeric_cols = df.select_dtypes(include=[np.number]).columns[:6] for ax, col in zip(axes.flat, numeric_cols): df[col].hist(bins=50, ax=ax, edgecolor="black", alpha=0.7) ax.set_title(col) ax.axvline(df[col].mean(), color="red", linestyle="--", label="mean") ax.axvline(df[col].median(), color="green", linestyle="--", label="median") ax.legend(fontsize=8) plt.tight_layout() plt.savefig("distributions.png", dpi=150)
Core Concepts
EDA Workflow
| Phase | Purpose | Tools |
|---|---|---|
| Profile | Understand data shape and types | df.info(), df.describe() |
| Clean | Handle missing data, outliers | dropna(), clip(), IQR filtering |
| Visualize | Discover patterns and distributions | histograms, scatter, heatmaps |
| Correlate | Find relationships between variables | Pearson, Spearman, mutual info |
| Cluster | Identify natural groupings | PCA, t-SNE, UMAP, k-means |
| Hypothesize | Formulate testable questions | Statistical summaries |
Correlation Analysis
# Correlation matrix with significance from scipy import stats def correlation_with_pvalues(df): """Compute correlation matrix with p-values""" cols = df.select_dtypes(include=[np.number]).columns n = len(cols) corr = np.zeros((n, n)) pval = np.zeros((n, n)) for i in range(n): for j in range(n): r, p = stats.pearsonr(df[cols[i]].dropna(), df[cols[j]].dropna()) corr[i, j] = r pval[i, j] = p return pd.DataFrame(corr, index=cols, columns=cols), \ pd.DataFrame(pval, index=cols, columns=cols) corr, pvals = correlation_with_pvalues(df) # Visualize plt.figure(figsize=(12, 10)) mask = np.triu(np.ones_like(corr, dtype=bool)) sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap="RdBu_r", center=0, vmin=-1, vmax=1) plt.title("Correlation Matrix") plt.tight_layout() plt.savefig("correlations.png", dpi=150)
Dimensionality Reduction
from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA import umap # PCA scaler = StandardScaler() X_scaled = scaler.fit_transform(df[numeric_cols].dropna()) pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled) print(f"Variance explained: {pca.explained_variance_ratio_}") # UMAP for nonlinear structure reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42) X_umap = reducer.fit_transform(X_scaled) fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) ax1.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5, s=10) ax1.set_title("PCA") ax2.scatter(X_umap[:, 0], X_umap[:, 1], alpha=0.5, s=10) ax2.set_title("UMAP") plt.savefig("dimensionality_reduction.png", dpi=150)
Configuration
| Parameter | Description | Default |
|---|---|---|
figure_dpi | Resolution for saved figures | 150 |
color_palette | Seaborn color scheme | viridis |
outlier_method | IQR, z-score, or isolation forest | iqr |
correlation_method | Pearson, Spearman, or Kendall | pearson |
n_components | Components for PCA/UMAP | 2 |
Best Practices
-
Profile before plotting. Run
df.info(),df.describe(), anddf.isnull().sum()before creating any visualizations. Understanding data types, ranges, and missingness patterns guides which visualizations are appropriate. -
Check for outliers before correlation analysis. A single extreme outlier can dominate Pearson correlation. Use Spearman rank correlation for robustness, or identify and handle outliers before computing correlations.
-
Use multiple visualization types. Histograms show distributions, scatter plots show relationships, heatmaps show correlation structure, and violin plots show group comparisons. Each reveals different aspects of the data.
-
Document hypotheses as you explore. Keep a running list of observations and hypotheses generated during EDA. This prevents losing insights and provides a structured starting point for formal analysis.
-
Save exploration state for reproducibility. Record the exact steps of your exploratory analysis (code cells, parameters, decisions) so you can reproduce or extend the exploration later. EDA is iterative but should still be traceable.
Common Issues
Visualizations are unreadable with too many variables. For high-dimensional data, select the top 20-30 variables by variance or mutual information before creating pairwise plots. Use dimensionality reduction (PCA, UMAP) for overview visualizations.
Missing data patterns distort analysis. Missing data is rarely random. Profile missingness patterns before removing rows or imputing values. Create a missing data heatmap to identify systematic patterns.
Correlation doesn't imply the relationship you expect. Spurious correlations are common in high-dimensional data. Verify strong correlations with domain knowledge, check for confounding variables, and use partial correlation to control for known confounders.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.