Pro PyDESeq2

Perform differential gene expression analysis using PyDESeq2, the Python implementation of the DESeq2 statistical framework. This skill covers count data normalization, single-factor and multi-factor experimental designs, result visualization, and gene set enrichment workflows for RNA-seq data.

When to Use This Skill

Choose Pro PyDESeq2 when you need to:

Identify differentially expressed genes from bulk RNA-seq count matrices
Perform multi-factor experimental design analysis (e.g., treatment × time)
Generate MA plots, volcano plots, and heatmaps of expression changes
Integrate DEG results with pathway and gene ontology enrichment analysis

Consider alternatives when:

You need single-cell differential expression (use scanpy or scvi-tools)
You need the original R DESeq2 with full feature parity (use R/Bioconductor)
You need splice variant-level analysis (use DEXSeq or rMATS)

Quick Start


pip install pydeseq2 pandas matplotlib


import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

# Load count matrix and sample metadata
counts = pd.read_csv("counts.csv", index_col=0)  # genes × samples
metadata = pd.read_csv("metadata.csv", index_col=0)  # samples × conditions

# Create DESeq dataset
dds = DeseqDataSet(
    counts=counts.T,  # PyDESeq2 expects samples × genes
    metadata=metadata,
    design_factors="condition"
)

# Run DESeq2 pipeline
dds.deseq2()

# Extract results for a specific contrast
stat_res = DeseqStats(dds, contrast=["condition", "treated", "control"])
stat_res.summary()

# Get results as DataFrame
results = stat_res.results_df
sig_genes = results[
    (results["padj"] < 0.05) & (abs(results["log2FoldChange"]) > 1)
]
print(f"Significant DEGs: {len(sig_genes)}")
print(sig_genes.sort_values("padj").head(10))

Core Concepts

Analysis Workflow

Step	Function	Output
Data loading	`DeseqDataSet()`	Structured count + metadata object
Size factor estimation	`dds.fit_size_factors()`	Normalized library sizes
Dispersion estimation	`dds.fit_genewise_dispersions()`	Per-gene variance estimates
Statistical testing	`DeseqStats()`	Wald test p-values
Multiple testing	`stat_res.summary()`	BH-adjusted p-values
Result extraction	`stat_res.results_df`	Gene-level results table

Multi-Factor Design


from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

# Multi-factor design: treatment + batch
metadata = pd.DataFrame({
    "treatment": ["control", "control", "treated", "treated",
                  "control", "control", "treated", "treated"],
    "batch": ["A", "A", "A", "A", "B", "B", "B", "B"],
}, index=counts.columns)

# Include batch as covariate
dds = DeseqDataSet(
    counts=counts.T,
    metadata=metadata,
    design_factors=["batch", "treatment"]
)
dds.deseq2()

# Test treatment effect while controlling for batch
stat_res = DeseqStats(
    dds, contrast=["treatment", "treated", "control"]
)
stat_res.summary()
results = stat_res.results_df

Visualization


import matplotlib.pyplot as plt
import numpy as np

def volcano_plot(results, padj_cutoff=0.05, lfc_cutoff=1.0):
    """Create a publication-quality volcano plot."""
    fig, ax = plt.subplots(figsize=(8, 6))

    # Classify genes
    sig_up = results[(results["padj"] < padj_cutoff) &
                     (results["log2FoldChange"] > lfc_cutoff)]
    sig_down = results[(results["padj"] < padj_cutoff) &
                       (results["log2FoldChange"] < -lfc_cutoff)]
    ns = results[~results.index.isin(sig_up.index.union(sig_down.index))]

    # Plot
    ax.scatter(ns["log2FoldChange"], -np.log10(ns["padj"]),
               c="gray", alpha=0.3, s=5, label="NS")
    ax.scatter(sig_up["log2FoldChange"], -np.log10(sig_up["padj"]),
               c="red", alpha=0.6, s=10, label=f"Up ({len(sig_up)})")
    ax.scatter(sig_down["log2FoldChange"], -np.log10(sig_down["padj"]),
               c="blue", alpha=0.6, s=10, label=f"Down ({len(sig_down)})")

    ax.axhline(-np.log10(padj_cutoff), ls="--", c="gray", lw=0.5)
    ax.axvline(lfc_cutoff, ls="--", c="gray", lw=0.5)
    ax.axvline(-lfc_cutoff, ls="--", c="gray", lw=0.5)

    ax.set_xlabel("log₂ Fold Change")
    ax.set_ylabel("-log₁₀ adjusted p-value")
    ax.legend()
    fig.tight_layout()
    fig.savefig("volcano_plot.png", dpi=300)

volcano_plot(results)

Configuration

Parameter	Description	Default
`design_factors`	Experimental design variables	Required
`contrast`	Comparison groups [factor, numerator, denominator]	Required
`refit_cooks`	Refit outliers flagged by Cook's distance	`true`
`alpha`	Significance threshold for results summary	`0.05`
`lfc_threshold`	Log2 fold change threshold for testing	`0`
`n_cpus`	Number of CPU cores for parallelization	`1`

Best Practices

Start with raw counts, not normalized data — PyDESeq2 performs its own normalization (median-of-ratios method). Providing TPM, FPKM, or pre-normalized data produces incorrect results. Use raw integer counts from featureCounts, HTSeq-count, or Salmon (with tximport-style length correction).
Filter lowly expressed genes before analysis — Remove genes with fewer than 10 total counts across all samples. Low-count genes have unreliable fold-change estimates and waste multiple testing corrections, reducing power to detect real DEGs.
Include batch effects in the design formula — If samples were processed in multiple batches (sequencing lanes, extraction dates), include batch as a covariate: design_factors=["batch", "treatment"]. This removes batch variation from the treatment effect estimate.
Use independent filtering on baseMean — Genes with very low mean expression across conditions have large fold changes by chance. PyDESeq2 applies independent filtering automatically, but verify that the baseMean cutoff is appropriate for your dataset.
Validate top hits with orthogonal methods — Confirm DEGs from the top of your list with qPCR, Western blot, or independent datasets. Statistical significance doesn't guarantee biological significance, and false positives at padj < 0.05 still occur at a 5% rate.

Common Issues

All adjusted p-values are NA or 1.0 — This indicates insufficient replication or extreme outliers. DESeq2 requires at least 2 replicates per condition (3+ recommended). Check for outlier samples using PCA and Cook's distance, and remove any samples with quality issues before re-running.

Count matrix dimension error — PyDESeq2 expects samples as rows and genes as columns (opposite of many bioinformatics tools). If you get dimension mismatch errors, transpose your count matrix with counts.T before creating the DeseqDataSet.

Fold changes seem too large for lowly expressed genes — Small counts produce extreme fold change estimates (e.g., 0 vs 3 reads → log2FC of infinity). PyDESeq2 applies shrinkage, but verify by filtering results on baseMean > 10. Ignore DEGs with very low baseMean regardless of their p-value or fold change.

⚠️ Loading Issue

Pro Pydeseq2

Pro PyDESeq2

When to Use This Skill

Quick Start

Core Concepts

Analysis Workflow

Multi-Factor Design

Visualization

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace