P

Pro Exploratory Workspace

Boost productivity using this perform, comprehensive, exploratory, data. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Pro Exploratory Workspace

A scientific computing skill for interactive exploratory data analysis (EDA) in scientific research. Pro Exploratory Workspace provides structured workflows for investigating datasets, generating hypotheses, identifying patterns, and creating visualizations that guide deeper analysis in any scientific domain.

When to Use This Skill

Choose Pro Exploratory Workspace when:

  • Starting analysis of a new scientific dataset with unknown patterns
  • Performing quality control and data validation before formal analysis
  • Generating exploratory visualizations to guide hypothesis formation
  • Building data profiling reports for collaborators

Consider alternatives when:

  • You have a specific analysis plan (use the appropriate statistical tool)
  • You need automated ML model building (use AutoML tools)
  • You're working with domain-specific data (use domain-specific analysis tools)
  • You need publication-ready figures (use matplotlib/seaborn with custom styling)

Quick Start

claude "Explore this gene expression dataset and identify interesting patterns"
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load and profile dataset df = pd.read_csv("experiment_data.csv") # Data overview print("=== Dataset Profile ===") print(f"Shape: {df.shape}") print(f"Columns: {list(df.columns)}") print(f"\nData types:\n{df.dtypes}") print(f"\nMissing values:\n{df.isnull().sum()}") print(f"\nDescriptive stats:\n{df.describe()}") # Distribution analysis fig, axes = plt.subplots(2, 3, figsize=(15, 10)) numeric_cols = df.select_dtypes(include=[np.number]).columns[:6] for ax, col in zip(axes.flat, numeric_cols): df[col].hist(bins=50, ax=ax, edgecolor="black", alpha=0.7) ax.set_title(col) ax.axvline(df[col].mean(), color="red", linestyle="--", label="mean") ax.axvline(df[col].median(), color="green", linestyle="--", label="median") ax.legend(fontsize=8) plt.tight_layout() plt.savefig("distributions.png", dpi=150)

Core Concepts

EDA Workflow

PhasePurposeTools
ProfileUnderstand data shape and typesdf.info(), df.describe()
CleanHandle missing data, outliersdropna(), clip(), IQR filtering
VisualizeDiscover patterns and distributionshistograms, scatter, heatmaps
CorrelateFind relationships between variablesPearson, Spearman, mutual info
ClusterIdentify natural groupingsPCA, t-SNE, UMAP, k-means
HypothesizeFormulate testable questionsStatistical summaries

Correlation Analysis

# Correlation matrix with significance from scipy import stats def correlation_with_pvalues(df): """Compute correlation matrix with p-values""" cols = df.select_dtypes(include=[np.number]).columns n = len(cols) corr = np.zeros((n, n)) pval = np.zeros((n, n)) for i in range(n): for j in range(n): r, p = stats.pearsonr(df[cols[i]].dropna(), df[cols[j]].dropna()) corr[i, j] = r pval[i, j] = p return pd.DataFrame(corr, index=cols, columns=cols), \ pd.DataFrame(pval, index=cols, columns=cols) corr, pvals = correlation_with_pvalues(df) # Visualize plt.figure(figsize=(12, 10)) mask = np.triu(np.ones_like(corr, dtype=bool)) sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap="RdBu_r", center=0, vmin=-1, vmax=1) plt.title("Correlation Matrix") plt.tight_layout() plt.savefig("correlations.png", dpi=150)

Dimensionality Reduction

from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA import umap # PCA scaler = StandardScaler() X_scaled = scaler.fit_transform(df[numeric_cols].dropna()) pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled) print(f"Variance explained: {pca.explained_variance_ratio_}") # UMAP for nonlinear structure reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42) X_umap = reducer.fit_transform(X_scaled) fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) ax1.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5, s=10) ax1.set_title("PCA") ax2.scatter(X_umap[:, 0], X_umap[:, 1], alpha=0.5, s=10) ax2.set_title("UMAP") plt.savefig("dimensionality_reduction.png", dpi=150)

Configuration

ParameterDescriptionDefault
figure_dpiResolution for saved figures150
color_paletteSeaborn color schemeviridis
outlier_methodIQR, z-score, or isolation forestiqr
correlation_methodPearson, Spearman, or Kendallpearson
n_componentsComponents for PCA/UMAP2

Best Practices

  1. Profile before plotting. Run df.info(), df.describe(), and df.isnull().sum() before creating any visualizations. Understanding data types, ranges, and missingness patterns guides which visualizations are appropriate.

  2. Check for outliers before correlation analysis. A single extreme outlier can dominate Pearson correlation. Use Spearman rank correlation for robustness, or identify and handle outliers before computing correlations.

  3. Use multiple visualization types. Histograms show distributions, scatter plots show relationships, heatmaps show correlation structure, and violin plots show group comparisons. Each reveals different aspects of the data.

  4. Document hypotheses as you explore. Keep a running list of observations and hypotheses generated during EDA. This prevents losing insights and provides a structured starting point for formal analysis.

  5. Save exploration state for reproducibility. Record the exact steps of your exploratory analysis (code cells, parameters, decisions) so you can reproduce or extend the exploration later. EDA is iterative but should still be traceable.

Common Issues

Visualizations are unreadable with too many variables. For high-dimensional data, select the top 20-30 variables by variance or mutual information before creating pairwise plots. Use dimensionality reduction (PCA, UMAP) for overview visualizations.

Missing data patterns distort analysis. Missing data is rarely random. Profile missingness patterns before removing rows or imputing values. Create a missing data heatmap to identify systematic patterns.

Correlation doesn't imply the relationship you expect. Spurious correlations are common in high-dimensional data. Verify strong correlations with domain knowledge, check for confounding variables, and use partial correlation to control for known confounders.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates