Pro Generate Workspace
Comprehensive skill designed for generate, edit, images, using. Includes structured workflows, validation checks, and reusable patterns for scientific.
Pro Generate Workspace
A scientific computing skill for automated data generation and synthetic dataset creation for testing, validation, and simulation purposes. Pro Generate Workspace helps you create realistic synthetic datasets that preserve statistical properties of real data while enabling reproducible benchmarking and pipeline testing.
When to Use This Skill
Choose Pro Generate Workspace when:
- Creating synthetic datasets for pipeline testing and validation
- Generating simulated experimental data with known ground truth
- Building benchmark datasets for algorithm comparison
- Producing privacy-preserving synthetic data from real data distributions
Consider alternatives when:
- You need real experimental data (use public repositories)
- You need data augmentation for ML training (use domain-specific augmentation)
- You need synthetic images (use generative models)
- You need simulated physical processes (use physics simulators)
Quick Start
claude "Generate a synthetic RNA-seq dataset with known differentially expressed genes"
import numpy as np import pandas as pd from scipy import stats def generate_rnaseq_dataset(n_genes=10000, n_samples=20, n_de_genes=500, fold_change=2.0, seed=42): """Generate synthetic RNA-seq count data with known DE genes""" np.random.seed(seed) # Base expression levels (log-normal distribution) base_means = np.exp(np.random.normal(5, 2, n_genes)) base_means = np.clip(base_means, 1, 50000) # Generate counts for control and treatment groups n_control = n_samples // 2 n_treatment = n_samples - n_control counts = np.zeros((n_genes, n_samples)) for i in range(n_genes): mean = base_means[i] dispersion = 0.1 + 10 / mean # Higher dispersion for low counts # Control samples for j in range(n_control): counts[i, j] = np.random.negative_binomial( 1 / dispersion, 1 / (1 + mean * dispersion) ) # Treatment samples (with fold change for DE genes) treatment_mean = mean * fold_change if i < n_de_genes else mean for j in range(n_control, n_samples): counts[i, j] = np.random.negative_binomial( 1 / dispersion, 1 / (1 + treatment_mean * dispersion) ) # Create DataFrame gene_names = [f"Gene_{i:05d}" for i in range(n_genes)] sample_names = ([f"Control_{i}" for i in range(n_control)] + [f"Treatment_{i}" for i in range(n_treatment)]) df = pd.DataFrame(counts, index=gene_names, columns=sample_names) # Ground truth de_genes = gene_names[:n_de_genes] return df, de_genes counts, true_de = generate_rnaseq_dataset() print(f"Dataset: {counts.shape[0]} genes × {counts.shape[1]} samples") print(f"True DE genes: {len(true_de)}")
Core Concepts
Synthetic Data Types
| Type | Method | Use Case |
|---|---|---|
| Parametric | Sample from known distributions | Hypothesis testing benchmarks |
| Permutation | Shuffle real data labels | Null distribution testing |
| Simulation | Model physical/biological process | Process understanding |
| Generative | Learn and sample from real data | Privacy-preserving datasets |
| Augmentation | Transform existing data points | Training data expansion |
Domain-Specific Generators
# Genomics: Simulated variant data def generate_vcf_data(n_variants=1000, n_samples=100): chroms = np.random.choice([f"chr{i}" for i in range(1, 23)], n_variants) positions = np.random.randint(1, 250_000_000, n_variants) ref = np.random.choice(["A", "C", "G", "T"], n_variants) alt = np.random.choice(["A", "C", "G", "T"], n_variants) genotypes = np.random.choice(["0/0", "0/1", "1/1"], (n_variants, n_samples), p=[0.7, 0.25, 0.05]) return pd.DataFrame({ "CHROM": chroms, "POS": positions, "REF": ref, "ALT": alt, **{f"Sample_{i}": genotypes[:, i] for i in range(n_samples)} }) # Clinical: Simulated patient cohort def generate_clinical_data(n_patients=500, seed=42): np.random.seed(seed) return pd.DataFrame({ "patient_id": range(n_patients), "age": np.random.normal(55, 15, n_patients).clip(18, 95).astype(int), "sex": np.random.choice(["M", "F"], n_patients), "bmi": np.random.normal(27, 5, n_patients).clip(15, 50).round(1), "treatment": np.random.choice(["Drug", "Placebo"], n_patients), "response": np.random.binomial(1, 0.3, n_patients), "survival_months": np.random.exponential(24, n_patients).round(1), })
Validation Framework
def validate_synthetic_data(real_df, synthetic_df): """Compare statistical properties of real and synthetic data""" report = {} for col in real_df.select_dtypes(include=[np.number]).columns: if col in synthetic_df.columns: ks_stat, ks_p = stats.ks_2samp(real_df[col].dropna(), synthetic_df[col].dropna()) report[col] = { "real_mean": real_df[col].mean(), "synth_mean": synthetic_df[col].mean(), "ks_statistic": ks_stat, "ks_pvalue": ks_p, "distributions_similar": ks_p > 0.05 } return pd.DataFrame(report).T
Configuration
| Parameter | Description | Default |
|---|---|---|
n_samples | Number of samples to generate | 100 |
n_features | Number of features/genes | 10000 |
effect_size | Magnitude of simulated effects | 2.0 |
noise_level | Amount of random noise | 0.1 |
seed | Random seed for reproducibility | 42 |
Best Practices
-
Always set a random seed. Reproducible synthetic data requires fixed seeds. Document the seed value alongside your results so others can regenerate the exact same dataset.
-
Match the statistical properties of real data. Fit distributions to your real data, then sample from those distributions. Uniform or normal distributions rarely match the characteristics of real experimental data.
-
Include known ground truth for benchmarking. The main advantage of synthetic data is knowing the answer. Always generate and return the ground truth (true DE genes, true clusters, true labels) alongside the synthetic data.
-
Validate synthetic data against real distributions. Use KS tests, QQ plots, and correlation structure comparisons to verify that synthetic data resembles real data. Poor synthetic data produces misleading benchmarks.
-
Generate multiple datasets for robust evaluation. A single synthetic dataset may be unrepresentative. Generate 10-100 datasets with different seeds to evaluate methods across data variations and report aggregate performance.
Common Issues
Synthetic data doesn't match real data distributions. Simple parametric models miss complex structure in real data (multimodality, correlations, outliers). Use more sophisticated generators: fit mixture models, preserve correlation structure, or use copula-based methods.
Pipeline works on synthetic data but fails on real data. Synthetic data is typically "cleaner" than real data. Add realistic noise: missing values, batch effects, outliers, and technical artifacts. Perfect synthetic data gives false confidence.
Generated counts have unrealistic values. Negative binomial distribution parameters must match real RNA-seq count distributions. Fit dispersion-mean relationships from real data rather than using fixed parameters. Plot the generated distribution against real data to verify.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.