P

Pro Pydeseq2

Streamline your workflow with this differential, gene, expression, analysis. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Pro PyDESeq2

Perform differential gene expression analysis using PyDESeq2, the Python implementation of the DESeq2 statistical framework. This skill covers count data normalization, single-factor and multi-factor experimental designs, result visualization, and gene set enrichment workflows for RNA-seq data.

When to Use This Skill

Choose Pro PyDESeq2 when you need to:

  • Identify differentially expressed genes from bulk RNA-seq count matrices
  • Perform multi-factor experimental design analysis (e.g., treatment × time)
  • Generate MA plots, volcano plots, and heatmaps of expression changes
  • Integrate DEG results with pathway and gene ontology enrichment analysis

Consider alternatives when:

  • You need single-cell differential expression (use scanpy or scvi-tools)
  • You need the original R DESeq2 with full feature parity (use R/Bioconductor)
  • You need splice variant-level analysis (use DEXSeq or rMATS)

Quick Start

pip install pydeseq2 pandas matplotlib
import pandas as pd from pydeseq2.dds import DeseqDataSet from pydeseq2.ds import DeseqStats # Load count matrix and sample metadata counts = pd.read_csv("counts.csv", index_col=0) # genes × samples metadata = pd.read_csv("metadata.csv", index_col=0) # samples × conditions # Create DESeq dataset dds = DeseqDataSet( counts=counts.T, # PyDESeq2 expects samples × genes metadata=metadata, design_factors="condition" ) # Run DESeq2 pipeline dds.deseq2() # Extract results for a specific contrast stat_res = DeseqStats(dds, contrast=["condition", "treated", "control"]) stat_res.summary() # Get results as DataFrame results = stat_res.results_df sig_genes = results[ (results["padj"] < 0.05) & (abs(results["log2FoldChange"]) > 1) ] print(f"Significant DEGs: {len(sig_genes)}") print(sig_genes.sort_values("padj").head(10))

Core Concepts

Analysis Workflow

StepFunctionOutput
Data loadingDeseqDataSet()Structured count + metadata object
Size factor estimationdds.fit_size_factors()Normalized library sizes
Dispersion estimationdds.fit_genewise_dispersions()Per-gene variance estimates
Statistical testingDeseqStats()Wald test p-values
Multiple testingstat_res.summary()BH-adjusted p-values
Result extractionstat_res.results_dfGene-level results table

Multi-Factor Design

from pydeseq2.dds import DeseqDataSet from pydeseq2.ds import DeseqStats # Multi-factor design: treatment + batch metadata = pd.DataFrame({ "treatment": ["control", "control", "treated", "treated", "control", "control", "treated", "treated"], "batch": ["A", "A", "A", "A", "B", "B", "B", "B"], }, index=counts.columns) # Include batch as covariate dds = DeseqDataSet( counts=counts.T, metadata=metadata, design_factors=["batch", "treatment"] ) dds.deseq2() # Test treatment effect while controlling for batch stat_res = DeseqStats( dds, contrast=["treatment", "treated", "control"] ) stat_res.summary() results = stat_res.results_df

Visualization

import matplotlib.pyplot as plt import numpy as np def volcano_plot(results, padj_cutoff=0.05, lfc_cutoff=1.0): """Create a publication-quality volcano plot.""" fig, ax = plt.subplots(figsize=(8, 6)) # Classify genes sig_up = results[(results["padj"] < padj_cutoff) & (results["log2FoldChange"] > lfc_cutoff)] sig_down = results[(results["padj"] < padj_cutoff) & (results["log2FoldChange"] < -lfc_cutoff)] ns = results[~results.index.isin(sig_up.index.union(sig_down.index))] # Plot ax.scatter(ns["log2FoldChange"], -np.log10(ns["padj"]), c="gray", alpha=0.3, s=5, label="NS") ax.scatter(sig_up["log2FoldChange"], -np.log10(sig_up["padj"]), c="red", alpha=0.6, s=10, label=f"Up ({len(sig_up)})") ax.scatter(sig_down["log2FoldChange"], -np.log10(sig_down["padj"]), c="blue", alpha=0.6, s=10, label=f"Down ({len(sig_down)})") ax.axhline(-np.log10(padj_cutoff), ls="--", c="gray", lw=0.5) ax.axvline(lfc_cutoff, ls="--", c="gray", lw=0.5) ax.axvline(-lfc_cutoff, ls="--", c="gray", lw=0.5) ax.set_xlabel("log₂ Fold Change") ax.set_ylabel("-log₁₀ adjusted p-value") ax.legend() fig.tight_layout() fig.savefig("volcano_plot.png", dpi=300) volcano_plot(results)

Configuration

ParameterDescriptionDefault
design_factorsExperimental design variablesRequired
contrastComparison groups [factor, numerator, denominator]Required
refit_cooksRefit outliers flagged by Cook's distancetrue
alphaSignificance threshold for results summary0.05
lfc_thresholdLog2 fold change threshold for testing0
n_cpusNumber of CPU cores for parallelization1

Best Practices

  1. Start with raw counts, not normalized data — PyDESeq2 performs its own normalization (median-of-ratios method). Providing TPM, FPKM, or pre-normalized data produces incorrect results. Use raw integer counts from featureCounts, HTSeq-count, or Salmon (with tximport-style length correction).

  2. Filter lowly expressed genes before analysis — Remove genes with fewer than 10 total counts across all samples. Low-count genes have unreliable fold-change estimates and waste multiple testing corrections, reducing power to detect real DEGs.

  3. Include batch effects in the design formula — If samples were processed in multiple batches (sequencing lanes, extraction dates), include batch as a covariate: design_factors=["batch", "treatment"]. This removes batch variation from the treatment effect estimate.

  4. Use independent filtering on baseMean — Genes with very low mean expression across conditions have large fold changes by chance. PyDESeq2 applies independent filtering automatically, but verify that the baseMean cutoff is appropriate for your dataset.

  5. Validate top hits with orthogonal methods — Confirm DEGs from the top of your list with qPCR, Western blot, or independent datasets. Statistical significance doesn't guarantee biological significance, and false positives at padj < 0.05 still occur at a 5% rate.

Common Issues

All adjusted p-values are NA or 1.0 — This indicates insufficient replication or extreme outliers. DESeq2 requires at least 2 replicates per condition (3+ recommended). Check for outlier samples using PCA and Cook's distance, and remove any samples with quality issues before re-running.

Count matrix dimension error — PyDESeq2 expects samples as rows and genes as columns (opposite of many bioinformatics tools). If you get dimension mismatch errors, transpose your count matrix with counts.T before creating the DeseqDataSet.

Fold changes seem too large for lowly expressed genes — Small counts produce extreme fold change estimates (e.g., 0 vs 3 reads → log2FC of infinity). PyDESeq2 applies shrinkage, but verify by filtering results on baseMean > 10. Ignore DEGs with very low baseMean regardless of their p-value or fold change.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates