Pro Generate Workspace

A scientific computing skill for automated data generation and synthetic dataset creation for testing, validation, and simulation purposes. Pro Generate Workspace helps you create realistic synthetic datasets that preserve statistical properties of real data while enabling reproducible benchmarking and pipeline testing.

When to Use This Skill

Choose Pro Generate Workspace when:

Creating synthetic datasets for pipeline testing and validation
Generating simulated experimental data with known ground truth
Building benchmark datasets for algorithm comparison
Producing privacy-preserving synthetic data from real data distributions

Consider alternatives when:

You need real experimental data (use public repositories)
You need data augmentation for ML training (use domain-specific augmentation)
You need synthetic images (use generative models)
You need simulated physical processes (use physics simulators)

Quick Start


claude "Generate a synthetic RNA-seq dataset with known differentially expressed genes"


import numpy as np
import pandas as pd
from scipy import stats

def generate_rnaseq_dataset(n_genes=10000, n_samples=20,
                            n_de_genes=500, fold_change=2.0,
                            seed=42):
    """Generate synthetic RNA-seq count data with known DE genes"""
    np.random.seed(seed)

    # Base expression levels (log-normal distribution)
    base_means = np.exp(np.random.normal(5, 2, n_genes))
    base_means = np.clip(base_means, 1, 50000)

    # Generate counts for control and treatment groups
    n_control = n_samples // 2
    n_treatment = n_samples - n_control

    counts = np.zeros((n_genes, n_samples))

    for i in range(n_genes):
        mean = base_means[i]
        dispersion = 0.1 + 10 / mean  # Higher dispersion for low counts

        # Control samples
        for j in range(n_control):
            counts[i, j] = np.random.negative_binomial(
                1 / dispersion, 1 / (1 + mean * dispersion)
            )

        # Treatment samples (with fold change for DE genes)
        treatment_mean = mean * fold_change if i < n_de_genes else mean
        for j in range(n_control, n_samples):
            counts[i, j] = np.random.negative_binomial(
                1 / dispersion, 1 / (1 + treatment_mean * dispersion)
            )

    # Create DataFrame
    gene_names = [f"Gene_{i:05d}" for i in range(n_genes)]
    sample_names = ([f"Control_{i}" for i in range(n_control)] +
                    [f"Treatment_{i}" for i in range(n_treatment)])

    df = pd.DataFrame(counts, index=gene_names, columns=sample_names)

    # Ground truth
    de_genes = gene_names[:n_de_genes]

    return df, de_genes

counts, true_de = generate_rnaseq_dataset()
print(f"Dataset: {counts.shape[0]} genes × {counts.shape[1]} samples")
print(f"True DE genes: {len(true_de)}")

Core Concepts

Synthetic Data Types

Type	Method	Use Case
Parametric	Sample from known distributions	Hypothesis testing benchmarks
Permutation	Shuffle real data labels	Null distribution testing
Simulation	Model physical/biological process	Process understanding
Generative	Learn and sample from real data	Privacy-preserving datasets
Augmentation	Transform existing data points	Training data expansion

Domain-Specific Generators


# Genomics: Simulated variant data
def generate_vcf_data(n_variants=1000, n_samples=100):
    chroms = np.random.choice([f"chr{i}" for i in range(1, 23)], n_variants)
    positions = np.random.randint(1, 250_000_000, n_variants)
    ref = np.random.choice(["A", "C", "G", "T"], n_variants)
    alt = np.random.choice(["A", "C", "G", "T"], n_variants)
    genotypes = np.random.choice(["0/0", "0/1", "1/1"],
                                  (n_variants, n_samples),
                                  p=[0.7, 0.25, 0.05])
    return pd.DataFrame({
        "CHROM": chroms, "POS": positions,
        "REF": ref, "ALT": alt,
        **{f"Sample_{i}": genotypes[:, i] for i in range(n_samples)}
    })

# Clinical: Simulated patient cohort
def generate_clinical_data(n_patients=500, seed=42):
    np.random.seed(seed)
    return pd.DataFrame({
        "patient_id": range(n_patients),
        "age": np.random.normal(55, 15, n_patients).clip(18, 95).astype(int),
        "sex": np.random.choice(["M", "F"], n_patients),
        "bmi": np.random.normal(27, 5, n_patients).clip(15, 50).round(1),
        "treatment": np.random.choice(["Drug", "Placebo"], n_patients),
        "response": np.random.binomial(1, 0.3, n_patients),
        "survival_months": np.random.exponential(24, n_patients).round(1),
    })

Validation Framework


def validate_synthetic_data(real_df, synthetic_df):
    """Compare statistical properties of real and synthetic data"""
    report = {}
    for col in real_df.select_dtypes(include=[np.number]).columns:
        if col in synthetic_df.columns:
            ks_stat, ks_p = stats.ks_2samp(real_df[col].dropna(),
                                            synthetic_df[col].dropna())
            report[col] = {
                "real_mean": real_df[col].mean(),
                "synth_mean": synthetic_df[col].mean(),
                "ks_statistic": ks_stat,
                "ks_pvalue": ks_p,
                "distributions_similar": ks_p > 0.05
            }
    return pd.DataFrame(report).T

Configuration

Parameter	Description	Default
`n_samples`	Number of samples to generate	`100`
`n_features`	Number of features/genes	`10000`
`effect_size`	Magnitude of simulated effects	`2.0`
`noise_level`	Amount of random noise	`0.1`
`seed`	Random seed for reproducibility	`42`

Best Practices

Always set a random seed. Reproducible synthetic data requires fixed seeds. Document the seed value alongside your results so others can regenerate the exact same dataset.
Match the statistical properties of real data. Fit distributions to your real data, then sample from those distributions. Uniform or normal distributions rarely match the characteristics of real experimental data.
Include known ground truth for benchmarking. The main advantage of synthetic data is knowing the answer. Always generate and return the ground truth (true DE genes, true clusters, true labels) alongside the synthetic data.
Validate synthetic data against real distributions. Use KS tests, QQ plots, and correlation structure comparisons to verify that synthetic data resembles real data. Poor synthetic data produces misleading benchmarks.
Generate multiple datasets for robust evaluation. A single synthetic dataset may be unrepresentative. Generate 10-100 datasets with different seeds to evaluate methods across data variations and report aggregate performance.

Common Issues

Synthetic data doesn't match real data distributions. Simple parametric models miss complex structure in real data (multimodality, correlations, outliers). Use more sophisticated generators: fit mixture models, preserve correlation structure, or use copula-based methods.

Pipeline works on synthetic data but fails on real data. Synthetic data is typically "cleaner" than real data. Add realistic noise: missing values, batch effects, outliers, and technical artifacts. Perfect synthetic data gives false confidence.

Generated counts have unrealistic values. Negative binomial distribution parameters must match real RNA-seq count distributions. Fit dispersion-mean relationships from real data rather than using fixed parameters. Plot the generated distribution against real data to verify.

⚠️ Loading Issue

Pro Generate Workspace

Pro Generate Workspace

When to Use This Skill

Quick Start

Core Concepts

Synthetic Data Types

Domain-Specific Generators

Validation Framework

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace