Pro Pydeseq2
Streamline your workflow with this differential, gene, expression, analysis. Includes structured workflows, validation checks, and reusable patterns for scientific.
Pro PyDESeq2
Perform differential gene expression analysis using PyDESeq2, the Python implementation of the DESeq2 statistical framework. This skill covers count data normalization, single-factor and multi-factor experimental designs, result visualization, and gene set enrichment workflows for RNA-seq data.
When to Use This Skill
Choose Pro PyDESeq2 when you need to:
- Identify differentially expressed genes from bulk RNA-seq count matrices
- Perform multi-factor experimental design analysis (e.g., treatment × time)
- Generate MA plots, volcano plots, and heatmaps of expression changes
- Integrate DEG results with pathway and gene ontology enrichment analysis
Consider alternatives when:
- You need single-cell differential expression (use scanpy or scvi-tools)
- You need the original R DESeq2 with full feature parity (use R/Bioconductor)
- You need splice variant-level analysis (use DEXSeq or rMATS)
Quick Start
pip install pydeseq2 pandas matplotlib
import pandas as pd from pydeseq2.dds import DeseqDataSet from pydeseq2.ds import DeseqStats # Load count matrix and sample metadata counts = pd.read_csv("counts.csv", index_col=0) # genes × samples metadata = pd.read_csv("metadata.csv", index_col=0) # samples × conditions # Create DESeq dataset dds = DeseqDataSet( counts=counts.T, # PyDESeq2 expects samples × genes metadata=metadata, design_factors="condition" ) # Run DESeq2 pipeline dds.deseq2() # Extract results for a specific contrast stat_res = DeseqStats(dds, contrast=["condition", "treated", "control"]) stat_res.summary() # Get results as DataFrame results = stat_res.results_df sig_genes = results[ (results["padj"] < 0.05) & (abs(results["log2FoldChange"]) > 1) ] print(f"Significant DEGs: {len(sig_genes)}") print(sig_genes.sort_values("padj").head(10))
Core Concepts
Analysis Workflow
| Step | Function | Output |
|---|---|---|
| Data loading | DeseqDataSet() | Structured count + metadata object |
| Size factor estimation | dds.fit_size_factors() | Normalized library sizes |
| Dispersion estimation | dds.fit_genewise_dispersions() | Per-gene variance estimates |
| Statistical testing | DeseqStats() | Wald test p-values |
| Multiple testing | stat_res.summary() | BH-adjusted p-values |
| Result extraction | stat_res.results_df | Gene-level results table |
Multi-Factor Design
from pydeseq2.dds import DeseqDataSet from pydeseq2.ds import DeseqStats # Multi-factor design: treatment + batch metadata = pd.DataFrame({ "treatment": ["control", "control", "treated", "treated", "control", "control", "treated", "treated"], "batch": ["A", "A", "A", "A", "B", "B", "B", "B"], }, index=counts.columns) # Include batch as covariate dds = DeseqDataSet( counts=counts.T, metadata=metadata, design_factors=["batch", "treatment"] ) dds.deseq2() # Test treatment effect while controlling for batch stat_res = DeseqStats( dds, contrast=["treatment", "treated", "control"] ) stat_res.summary() results = stat_res.results_df
Visualization
import matplotlib.pyplot as plt import numpy as np def volcano_plot(results, padj_cutoff=0.05, lfc_cutoff=1.0): """Create a publication-quality volcano plot.""" fig, ax = plt.subplots(figsize=(8, 6)) # Classify genes sig_up = results[(results["padj"] < padj_cutoff) & (results["log2FoldChange"] > lfc_cutoff)] sig_down = results[(results["padj"] < padj_cutoff) & (results["log2FoldChange"] < -lfc_cutoff)] ns = results[~results.index.isin(sig_up.index.union(sig_down.index))] # Plot ax.scatter(ns["log2FoldChange"], -np.log10(ns["padj"]), c="gray", alpha=0.3, s=5, label="NS") ax.scatter(sig_up["log2FoldChange"], -np.log10(sig_up["padj"]), c="red", alpha=0.6, s=10, label=f"Up ({len(sig_up)})") ax.scatter(sig_down["log2FoldChange"], -np.log10(sig_down["padj"]), c="blue", alpha=0.6, s=10, label=f"Down ({len(sig_down)})") ax.axhline(-np.log10(padj_cutoff), ls="--", c="gray", lw=0.5) ax.axvline(lfc_cutoff, ls="--", c="gray", lw=0.5) ax.axvline(-lfc_cutoff, ls="--", c="gray", lw=0.5) ax.set_xlabel("log₂ Fold Change") ax.set_ylabel("-log₁₀ adjusted p-value") ax.legend() fig.tight_layout() fig.savefig("volcano_plot.png", dpi=300) volcano_plot(results)
Configuration
| Parameter | Description | Default |
|---|---|---|
design_factors | Experimental design variables | Required |
contrast | Comparison groups [factor, numerator, denominator] | Required |
refit_cooks | Refit outliers flagged by Cook's distance | true |
alpha | Significance threshold for results summary | 0.05 |
lfc_threshold | Log2 fold change threshold for testing | 0 |
n_cpus | Number of CPU cores for parallelization | 1 |
Best Practices
-
Start with raw counts, not normalized data — PyDESeq2 performs its own normalization (median-of-ratios method). Providing TPM, FPKM, or pre-normalized data produces incorrect results. Use raw integer counts from featureCounts, HTSeq-count, or Salmon (with
tximport-style length correction). -
Filter lowly expressed genes before analysis — Remove genes with fewer than 10 total counts across all samples. Low-count genes have unreliable fold-change estimates and waste multiple testing corrections, reducing power to detect real DEGs.
-
Include batch effects in the design formula — If samples were processed in multiple batches (sequencing lanes, extraction dates), include batch as a covariate:
design_factors=["batch", "treatment"]. This removes batch variation from the treatment effect estimate. -
Use independent filtering on baseMean — Genes with very low mean expression across conditions have large fold changes by chance. PyDESeq2 applies independent filtering automatically, but verify that the baseMean cutoff is appropriate for your dataset.
-
Validate top hits with orthogonal methods — Confirm DEGs from the top of your list with qPCR, Western blot, or independent datasets. Statistical significance doesn't guarantee biological significance, and false positives at padj < 0.05 still occur at a 5% rate.
Common Issues
All adjusted p-values are NA or 1.0 — This indicates insufficient replication or extreme outliers. DESeq2 requires at least 2 replicates per condition (3+ recommended). Check for outlier samples using PCA and Cook's distance, and remove any samples with quality issues before re-running.
Count matrix dimension error — PyDESeq2 expects samples as rows and genes as columns (opposite of many bioinformatics tools). If you get dimension mismatch errors, transpose your count matrix with counts.T before creating the DeseqDataSet.
Fold changes seem too large for lowly expressed genes — Small counts produce extreme fold change estimates (e.g., 0 vs 3 reads → log2FC of infinity). PyDESeq2 applies shrinkage, but verify by filtering results on baseMean > 10. Ignore DEGs with very low baseMean regardless of their p-value or fold change.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.