P

Pro Generate Workspace

Comprehensive skill designed for generate, edit, images, using. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Pro Generate Workspace

A scientific computing skill for automated data generation and synthetic dataset creation for testing, validation, and simulation purposes. Pro Generate Workspace helps you create realistic synthetic datasets that preserve statistical properties of real data while enabling reproducible benchmarking and pipeline testing.

When to Use This Skill

Choose Pro Generate Workspace when:

  • Creating synthetic datasets for pipeline testing and validation
  • Generating simulated experimental data with known ground truth
  • Building benchmark datasets for algorithm comparison
  • Producing privacy-preserving synthetic data from real data distributions

Consider alternatives when:

  • You need real experimental data (use public repositories)
  • You need data augmentation for ML training (use domain-specific augmentation)
  • You need synthetic images (use generative models)
  • You need simulated physical processes (use physics simulators)

Quick Start

claude "Generate a synthetic RNA-seq dataset with known differentially expressed genes"
import numpy as np import pandas as pd from scipy import stats def generate_rnaseq_dataset(n_genes=10000, n_samples=20, n_de_genes=500, fold_change=2.0, seed=42): """Generate synthetic RNA-seq count data with known DE genes""" np.random.seed(seed) # Base expression levels (log-normal distribution) base_means = np.exp(np.random.normal(5, 2, n_genes)) base_means = np.clip(base_means, 1, 50000) # Generate counts for control and treatment groups n_control = n_samples // 2 n_treatment = n_samples - n_control counts = np.zeros((n_genes, n_samples)) for i in range(n_genes): mean = base_means[i] dispersion = 0.1 + 10 / mean # Higher dispersion for low counts # Control samples for j in range(n_control): counts[i, j] = np.random.negative_binomial( 1 / dispersion, 1 / (1 + mean * dispersion) ) # Treatment samples (with fold change for DE genes) treatment_mean = mean * fold_change if i < n_de_genes else mean for j in range(n_control, n_samples): counts[i, j] = np.random.negative_binomial( 1 / dispersion, 1 / (1 + treatment_mean * dispersion) ) # Create DataFrame gene_names = [f"Gene_{i:05d}" for i in range(n_genes)] sample_names = ([f"Control_{i}" for i in range(n_control)] + [f"Treatment_{i}" for i in range(n_treatment)]) df = pd.DataFrame(counts, index=gene_names, columns=sample_names) # Ground truth de_genes = gene_names[:n_de_genes] return df, de_genes counts, true_de = generate_rnaseq_dataset() print(f"Dataset: {counts.shape[0]} genes × {counts.shape[1]} samples") print(f"True DE genes: {len(true_de)}")

Core Concepts

Synthetic Data Types

TypeMethodUse Case
ParametricSample from known distributionsHypothesis testing benchmarks
PermutationShuffle real data labelsNull distribution testing
SimulationModel physical/biological processProcess understanding
GenerativeLearn and sample from real dataPrivacy-preserving datasets
AugmentationTransform existing data pointsTraining data expansion

Domain-Specific Generators

# Genomics: Simulated variant data def generate_vcf_data(n_variants=1000, n_samples=100): chroms = np.random.choice([f"chr{i}" for i in range(1, 23)], n_variants) positions = np.random.randint(1, 250_000_000, n_variants) ref = np.random.choice(["A", "C", "G", "T"], n_variants) alt = np.random.choice(["A", "C", "G", "T"], n_variants) genotypes = np.random.choice(["0/0", "0/1", "1/1"], (n_variants, n_samples), p=[0.7, 0.25, 0.05]) return pd.DataFrame({ "CHROM": chroms, "POS": positions, "REF": ref, "ALT": alt, **{f"Sample_{i}": genotypes[:, i] for i in range(n_samples)} }) # Clinical: Simulated patient cohort def generate_clinical_data(n_patients=500, seed=42): np.random.seed(seed) return pd.DataFrame({ "patient_id": range(n_patients), "age": np.random.normal(55, 15, n_patients).clip(18, 95).astype(int), "sex": np.random.choice(["M", "F"], n_patients), "bmi": np.random.normal(27, 5, n_patients).clip(15, 50).round(1), "treatment": np.random.choice(["Drug", "Placebo"], n_patients), "response": np.random.binomial(1, 0.3, n_patients), "survival_months": np.random.exponential(24, n_patients).round(1), })

Validation Framework

def validate_synthetic_data(real_df, synthetic_df): """Compare statistical properties of real and synthetic data""" report = {} for col in real_df.select_dtypes(include=[np.number]).columns: if col in synthetic_df.columns: ks_stat, ks_p = stats.ks_2samp(real_df[col].dropna(), synthetic_df[col].dropna()) report[col] = { "real_mean": real_df[col].mean(), "synth_mean": synthetic_df[col].mean(), "ks_statistic": ks_stat, "ks_pvalue": ks_p, "distributions_similar": ks_p > 0.05 } return pd.DataFrame(report).T

Configuration

ParameterDescriptionDefault
n_samplesNumber of samples to generate100
n_featuresNumber of features/genes10000
effect_sizeMagnitude of simulated effects2.0
noise_levelAmount of random noise0.1
seedRandom seed for reproducibility42

Best Practices

  1. Always set a random seed. Reproducible synthetic data requires fixed seeds. Document the seed value alongside your results so others can regenerate the exact same dataset.

  2. Match the statistical properties of real data. Fit distributions to your real data, then sample from those distributions. Uniform or normal distributions rarely match the characteristics of real experimental data.

  3. Include known ground truth for benchmarking. The main advantage of synthetic data is knowing the answer. Always generate and return the ground truth (true DE genes, true clusters, true labels) alongside the synthetic data.

  4. Validate synthetic data against real distributions. Use KS tests, QQ plots, and correlation structure comparisons to verify that synthetic data resembles real data. Poor synthetic data produces misleading benchmarks.

  5. Generate multiple datasets for robust evaluation. A single synthetic dataset may be unrepresentative. Generate 10-100 datasets with different seeds to evaluate methods across data variations and report aggregate performance.

Common Issues

Synthetic data doesn't match real data distributions. Simple parametric models miss complex structure in real data (multimodality, correlations, outliers). Use more sophisticated generators: fit mixture models, preserve correlation structure, or use copula-based methods.

Pipeline works on synthetic data but fails on real data. Synthetic data is typically "cleaner" than real data. Add realistic noise: missing values, batch effects, outliers, and technical artifacts. Perfect synthetic data gives false confidence.

Generated counts have unrealistic values. Negative binomial distribution parameters must match real RNA-seq count distributions. Fit dispersion-mean relationships from real data rather than using fixed parameters. Plot the generated distribution against real data to verify.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates