Scanpy Elite
All-in-one skill covering single, cell, analysis, load. Includes structured workflows, validation checks, and reusable patterns for scientific.
Scanpy Elite
Analyze single-cell RNA sequencing data using Scanpy, Python's leading single-cell analysis toolkit. This skill covers data loading, quality control, normalization, dimensionality reduction, clustering, differential expression, and trajectory analysis for scRNA-seq experiments.
When to Use This Skill
Choose Scanpy Elite when you need to:
- Process and analyze scRNA-seq data from 10x Genomics, Smart-seq2, or other platforms
- Perform cell clustering, marker gene identification, and cell type annotation
- Run trajectory and pseudotime analysis on developmental or differentiation data
- Integrate multiple single-cell datasets for batch correction and atlas building
Consider alternatives when:
- You need deep generative models for scRNA-seq (use scvi-tools)
- You need spatial transcriptomics analysis (use Squidpy or SpatialDE)
- You need R-based analysis (use Seurat)
Quick Start
pip install scanpy leidenalg
import scanpy as sc # Load 10x Genomics data adata = sc.read_10x_mtx("filtered_gene_bc_matrices/hg19/") print(f"Cells: {adata.n_obs}, Genes: {adata.n_vars}") # Quality control sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_genes(adata, min_cells=3) adata.var["mt"] = adata.var_names.str.startswith("MT-") sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True) adata = adata[adata.obs.pct_counts_mt < 20] # Standard preprocessing sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000) adata = adata[:, adata.var.highly_variable] # Dimensionality reduction and clustering sc.pp.scale(adata, max_value=10) sc.tl.pca(adata) sc.pp.neighbors(adata, n_neighbors=15) sc.tl.umap(adata) sc.tl.leiden(adata, resolution=0.5) # Visualize sc.pl.umap(adata, color="leiden", save="_clusters.png")
Core Concepts
Analysis Pipeline
| Step | Function | Purpose |
|---|---|---|
| QC filtering | sc.pp.filter_cells/genes | Remove low-quality cells/genes |
| Normalization | sc.pp.normalize_total | Library size normalization |
| Log transform | sc.pp.log1p | Stabilize variance |
| HVG selection | sc.pp.highly_variable_genes | Feature selection |
| Scaling | sc.pp.scale | Zero-center, clip outliers |
| PCA | sc.tl.pca | Linear dimensionality reduction |
| Neighbors | sc.pp.neighbors | Build kNN graph |
| Clustering | sc.tl.leiden | Community detection |
| UMAP | sc.tl.umap | 2D visualization |
| DEG analysis | sc.tl.rank_genes_groups | Marker gene identification |
Marker Gene Analysis and Cell Type Annotation
import scanpy as sc # Find marker genes for each cluster sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon") # View top markers per cluster sc.pl.rank_genes_groups(adata, n_genes=10, save="_markers.png") # Get results as DataFrame markers = sc.get.rank_genes_groups_df(adata, group=None) top_markers = markers.groupby("group").head(5) print(top_markers[["group", "names", "logfoldchanges", "pvals_adj"]]) # Manual annotation based on known markers cluster_annotations = { "0": "CD4 T cells", # CD3D, IL7R "1": "CD14 Monocytes", # CD14, LYZ "2": "B cells", # CD79A, MS4A1 "3": "CD8 T cells", # CD8A, GZMB "4": "NK cells", # NKG7, GNLY "5": "Dendritic cells", # FCER1A, CST3 } adata.obs["cell_type"] = adata.obs["leiden"].map(cluster_annotations) sc.pl.umap(adata, color="cell_type", save="_celltypes.png")
Dataset Integration
import scanpy as sc def integrate_datasets(adata_list, batch_key="batch"): """Integrate multiple scRNA-seq datasets with batch correction.""" # Concatenate adata = sc.concat(adata_list, label=batch_key) # Standard preprocessing sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, batch_key=batch_key, n_top_genes=2000) adata = adata[:, adata.var.highly_variable] # Batch correction with Harmony sc.pp.scale(adata, max_value=10) sc.tl.pca(adata) sc.external.pp.harmony_integrate(adata, key=batch_key) # Use corrected embeddings for downstream analysis sc.pp.neighbors(adata, use_rep="X_pca_harmony") sc.tl.umap(adata) sc.tl.leiden(adata, resolution=0.5) return adata # Integrate two datasets adata1 = sc.read_h5ad("sample1.h5ad") adata2 = sc.read_h5ad("sample2.h5ad") integrated = integrate_datasets([adata1, adata2]) sc.pl.umap(integrated, color=["batch", "leiden"], save="_integrated.png")
Configuration
| Parameter | Description | Default |
|---|---|---|
min_genes | Minimum genes per cell for QC | 200 |
min_cells | Minimum cells per gene for QC | 3 |
n_top_genes | Number of highly variable genes | 2000 |
n_neighbors | k for kNN graph construction | 15 |
n_pcs | Number of principal components | 50 |
resolution | Leiden clustering resolution | 1.0 |
Best Practices
-
Always perform quality control before analysis — Filter cells with too few genes (<200), too many genes (potential doublets >5000), and high mitochondrial percentage (>20%, indicating dying cells). Plot QC distributions to choose thresholds appropriate for your dataset.
-
Save the raw counts layer — Store raw counts in
adata.raw = adata.copy()before normalization. This preserves original data for differential expression testing, which requires raw counts, while using normalized data for clustering and visualization. -
Choose resolution based on biological expectation — Leiden resolution controls cluster granularity. Higher resolution (1.0-2.0) produces more clusters. Start with 0.5-1.0 and adjust based on whether clusters have distinct marker genes. Over-clustering splits real populations; under-clustering merges distinct types.
-
Use multiple marker genes for cell type annotation — Never annotate cell types based on a single marker gene. Use panels of 3-5 known markers per cell type and verify with dot plots showing expression across clusters. Ambiguous clusters may represent transitional states or doublets.
-
Apply batch correction when integrating datasets — Different samples, donors, or sequencing runs introduce batch effects that create artificial separation. Use Harmony, scanorama, or BBKNN for integration. Always verify that batch-corrected clusters contain cells from multiple batches.
Common Issues
Clusters driven by batch effects instead of biology — Plot UMAP colored by batch/sample to check. If clusters segregate by batch, apply batch correction. If correction over-merges distinct cell types, reduce correction strength or use a different method.
Too few or too many clusters — Adjust Leiden resolution parameter. Also check that PCA captures enough variance (plot explained variance ratio) and that n_neighbors is appropriate for your dataset size (larger datasets need more neighbors).
Marker genes not specific to clusters — Low-specificity markers (expressed in many clusters) indicate over-clustering or biological continuum. Merge similar clusters using sc.tl.dendrogram to identify which clusters are most similar, then re-annotate.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.