Scanpy Elite

Analyze single-cell RNA sequencing data using Scanpy, Python's leading single-cell analysis toolkit. This skill covers data loading, quality control, normalization, dimensionality reduction, clustering, differential expression, and trajectory analysis for scRNA-seq experiments.

When to Use This Skill

Choose Scanpy Elite when you need to:

Process and analyze scRNA-seq data from 10x Genomics, Smart-seq2, or other platforms
Perform cell clustering, marker gene identification, and cell type annotation
Run trajectory and pseudotime analysis on developmental or differentiation data
Integrate multiple single-cell datasets for batch correction and atlas building

Consider alternatives when:

You need deep generative models for scRNA-seq (use scvi-tools)
You need spatial transcriptomics analysis (use Squidpy or SpatialDE)
You need R-based analysis (use Seurat)

Quick Start


pip install scanpy leidenalg


import scanpy as sc

# Load 10x Genomics data
adata = sc.read_10x_mtx("filtered_gene_bc_matrices/hg19/")
print(f"Cells: {adata.n_obs}, Genes: {adata.n_vars}")

# Quality control
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)
adata = adata[adata.obs.pct_counts_mt < 20]

# Standard preprocessing
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
adata = adata[:, adata.var.highly_variable]

# Dimensionality reduction and clustering
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_neighbors=15)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)

# Visualize
sc.pl.umap(adata, color="leiden", save="_clusters.png")

Core Concepts

Analysis Pipeline

Step	Function	Purpose
QC filtering	`sc.pp.filter_cells/genes`	Remove low-quality cells/genes
Normalization	`sc.pp.normalize_total`	Library size normalization
Log transform	`sc.pp.log1p`	Stabilize variance
HVG selection	`sc.pp.highly_variable_genes`	Feature selection
Scaling	`sc.pp.scale`	Zero-center, clip outliers
PCA	`sc.tl.pca`	Linear dimensionality reduction
Neighbors	`sc.pp.neighbors`	Build kNN graph
Clustering	`sc.tl.leiden`	Community detection
UMAP	`sc.tl.umap`	2D visualization
DEG analysis	`sc.tl.rank_genes_groups`	Marker gene identification

Marker Gene Analysis and Cell Type Annotation


import scanpy as sc

# Find marker genes for each cluster
sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon")

# View top markers per cluster
sc.pl.rank_genes_groups(adata, n_genes=10, save="_markers.png")

# Get results as DataFrame
markers = sc.get.rank_genes_groups_df(adata, group=None)
top_markers = markers.groupby("group").head(5)
print(top_markers[["group", "names", "logfoldchanges", "pvals_adj"]])

# Manual annotation based on known markers
cluster_annotations = {
    "0": "CD4 T cells",       # CD3D, IL7R
    "1": "CD14 Monocytes",    # CD14, LYZ
    "2": "B cells",           # CD79A, MS4A1
    "3": "CD8 T cells",       # CD8A, GZMB
    "4": "NK cells",          # NKG7, GNLY
    "5": "Dendritic cells",   # FCER1A, CST3
}
adata.obs["cell_type"] = adata.obs["leiden"].map(cluster_annotations)

sc.pl.umap(adata, color="cell_type", save="_celltypes.png")

Dataset Integration


import scanpy as sc

def integrate_datasets(adata_list, batch_key="batch"):
    """Integrate multiple scRNA-seq datasets with batch correction."""
    # Concatenate
    adata = sc.concat(adata_list, label=batch_key)

    # Standard preprocessing
    sc.pp.normalize_total(adata, target_sum=1e4)
    sc.pp.log1p(adata)
    sc.pp.highly_variable_genes(adata, batch_key=batch_key, n_top_genes=2000)
    adata = adata[:, adata.var.highly_variable]

    # Batch correction with Harmony
    sc.pp.scale(adata, max_value=10)
    sc.tl.pca(adata)
    sc.external.pp.harmony_integrate(adata, key=batch_key)

    # Use corrected embeddings for downstream analysis
    sc.pp.neighbors(adata, use_rep="X_pca_harmony")
    sc.tl.umap(adata)
    sc.tl.leiden(adata, resolution=0.5)

    return adata

# Integrate two datasets
adata1 = sc.read_h5ad("sample1.h5ad")
adata2 = sc.read_h5ad("sample2.h5ad")
integrated = integrate_datasets([adata1, adata2])
sc.pl.umap(integrated, color=["batch", "leiden"], save="_integrated.png")

Configuration

Parameter	Description	Default
`min_genes`	Minimum genes per cell for QC	`200`
`min_cells`	Minimum cells per gene for QC	`3`
`n_top_genes`	Number of highly variable genes	`2000`
`n_neighbors`	k for kNN graph construction	`15`
`n_pcs`	Number of principal components	`50`
`resolution`	Leiden clustering resolution	`1.0`

Best Practices

Always perform quality control before analysis — Filter cells with too few genes (<200), too many genes (potential doublets >5000), and high mitochondrial percentage (>20%, indicating dying cells). Plot QC distributions to choose thresholds appropriate for your dataset.
Save the raw counts layer — Store raw counts in adata.raw = adata.copy() before normalization. This preserves original data for differential expression testing, which requires raw counts, while using normalized data for clustering and visualization.
Choose resolution based on biological expectation — Leiden resolution controls cluster granularity. Higher resolution (1.0-2.0) produces more clusters. Start with 0.5-1.0 and adjust based on whether clusters have distinct marker genes. Over-clustering splits real populations; under-clustering merges distinct types.
Use multiple marker genes for cell type annotation — Never annotate cell types based on a single marker gene. Use panels of 3-5 known markers per cell type and verify with dot plots showing expression across clusters. Ambiguous clusters may represent transitional states or doublets.
Apply batch correction when integrating datasets — Different samples, donors, or sequencing runs introduce batch effects that create artificial separation. Use Harmony, scanorama, or BBKNN for integration. Always verify that batch-corrected clusters contain cells from multiple batches.

Common Issues

Clusters driven by batch effects instead of biology — Plot UMAP colored by batch/sample to check. If clusters segregate by batch, apply batch correction. If correction over-merges distinct cell types, reduce correction strength or use a different method.

Too few or too many clusters — Adjust Leiden resolution parameter. Also check that PCA captures enough variance (plot explained variance ratio) and that n_neighbors is appropriate for your dataset size (larger datasets need more neighbors).

Marker genes not specific to clusters — Low-specificity markers (expressed in many clusters) indicate over-clustering or biological continuum. Merge similar clusters using sc.tl.dendrogram to identify which clusters are most similar, then re-annotate.

⚠️ Loading Issue

Scanpy Elite

Scanpy Elite

When to Use This Skill

Quick Start

Core Concepts

Analysis Pipeline

Marker Gene Analysis and Cell Type Annotation

Dataset Integration

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace