S

Scanpy Elite

All-in-one skill covering single, cell, analysis, load. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Scanpy Elite

Analyze single-cell RNA sequencing data using Scanpy, Python's leading single-cell analysis toolkit. This skill covers data loading, quality control, normalization, dimensionality reduction, clustering, differential expression, and trajectory analysis for scRNA-seq experiments.

When to Use This Skill

Choose Scanpy Elite when you need to:

  • Process and analyze scRNA-seq data from 10x Genomics, Smart-seq2, or other platforms
  • Perform cell clustering, marker gene identification, and cell type annotation
  • Run trajectory and pseudotime analysis on developmental or differentiation data
  • Integrate multiple single-cell datasets for batch correction and atlas building

Consider alternatives when:

  • You need deep generative models for scRNA-seq (use scvi-tools)
  • You need spatial transcriptomics analysis (use Squidpy or SpatialDE)
  • You need R-based analysis (use Seurat)

Quick Start

pip install scanpy leidenalg
import scanpy as sc # Load 10x Genomics data adata = sc.read_10x_mtx("filtered_gene_bc_matrices/hg19/") print(f"Cells: {adata.n_obs}, Genes: {adata.n_vars}") # Quality control sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_genes(adata, min_cells=3) adata.var["mt"] = adata.var_names.str.startswith("MT-") sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True) adata = adata[adata.obs.pct_counts_mt < 20] # Standard preprocessing sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000) adata = adata[:, adata.var.highly_variable] # Dimensionality reduction and clustering sc.pp.scale(adata, max_value=10) sc.tl.pca(adata) sc.pp.neighbors(adata, n_neighbors=15) sc.tl.umap(adata) sc.tl.leiden(adata, resolution=0.5) # Visualize sc.pl.umap(adata, color="leiden", save="_clusters.png")

Core Concepts

Analysis Pipeline

StepFunctionPurpose
QC filteringsc.pp.filter_cells/genesRemove low-quality cells/genes
Normalizationsc.pp.normalize_totalLibrary size normalization
Log transformsc.pp.log1pStabilize variance
HVG selectionsc.pp.highly_variable_genesFeature selection
Scalingsc.pp.scaleZero-center, clip outliers
PCAsc.tl.pcaLinear dimensionality reduction
Neighborssc.pp.neighborsBuild kNN graph
Clusteringsc.tl.leidenCommunity detection
UMAPsc.tl.umap2D visualization
DEG analysissc.tl.rank_genes_groupsMarker gene identification

Marker Gene Analysis and Cell Type Annotation

import scanpy as sc # Find marker genes for each cluster sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon") # View top markers per cluster sc.pl.rank_genes_groups(adata, n_genes=10, save="_markers.png") # Get results as DataFrame markers = sc.get.rank_genes_groups_df(adata, group=None) top_markers = markers.groupby("group").head(5) print(top_markers[["group", "names", "logfoldchanges", "pvals_adj"]]) # Manual annotation based on known markers cluster_annotations = { "0": "CD4 T cells", # CD3D, IL7R "1": "CD14 Monocytes", # CD14, LYZ "2": "B cells", # CD79A, MS4A1 "3": "CD8 T cells", # CD8A, GZMB "4": "NK cells", # NKG7, GNLY "5": "Dendritic cells", # FCER1A, CST3 } adata.obs["cell_type"] = adata.obs["leiden"].map(cluster_annotations) sc.pl.umap(adata, color="cell_type", save="_celltypes.png")

Dataset Integration

import scanpy as sc def integrate_datasets(adata_list, batch_key="batch"): """Integrate multiple scRNA-seq datasets with batch correction.""" # Concatenate adata = sc.concat(adata_list, label=batch_key) # Standard preprocessing sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, batch_key=batch_key, n_top_genes=2000) adata = adata[:, adata.var.highly_variable] # Batch correction with Harmony sc.pp.scale(adata, max_value=10) sc.tl.pca(adata) sc.external.pp.harmony_integrate(adata, key=batch_key) # Use corrected embeddings for downstream analysis sc.pp.neighbors(adata, use_rep="X_pca_harmony") sc.tl.umap(adata) sc.tl.leiden(adata, resolution=0.5) return adata # Integrate two datasets adata1 = sc.read_h5ad("sample1.h5ad") adata2 = sc.read_h5ad("sample2.h5ad") integrated = integrate_datasets([adata1, adata2]) sc.pl.umap(integrated, color=["batch", "leiden"], save="_integrated.png")

Configuration

ParameterDescriptionDefault
min_genesMinimum genes per cell for QC200
min_cellsMinimum cells per gene for QC3
n_top_genesNumber of highly variable genes2000
n_neighborsk for kNN graph construction15
n_pcsNumber of principal components50
resolutionLeiden clustering resolution1.0

Best Practices

  1. Always perform quality control before analysis — Filter cells with too few genes (<200), too many genes (potential doublets >5000), and high mitochondrial percentage (>20%, indicating dying cells). Plot QC distributions to choose thresholds appropriate for your dataset.

  2. Save the raw counts layer — Store raw counts in adata.raw = adata.copy() before normalization. This preserves original data for differential expression testing, which requires raw counts, while using normalized data for clustering and visualization.

  3. Choose resolution based on biological expectation — Leiden resolution controls cluster granularity. Higher resolution (1.0-2.0) produces more clusters. Start with 0.5-1.0 and adjust based on whether clusters have distinct marker genes. Over-clustering splits real populations; under-clustering merges distinct types.

  4. Use multiple marker genes for cell type annotation — Never annotate cell types based on a single marker gene. Use panels of 3-5 known markers per cell type and verify with dot plots showing expression across clusters. Ambiguous clusters may represent transitional states or doublets.

  5. Apply batch correction when integrating datasets — Different samples, donors, or sequencing runs introduce batch effects that create artificial separation. Use Harmony, scanorama, or BBKNN for integration. Always verify that batch-corrected clusters contain cells from multiple batches.

Common Issues

Clusters driven by batch effects instead of biology — Plot UMAP colored by batch/sample to check. If clusters segregate by batch, apply batch correction. If correction over-merges distinct cell types, reduce correction strength or use a different method.

Too few or too many clusters — Adjust Leiden resolution parameter. Also check that PCA captures enough variance (plot explained variance ratio) and that n_neighbors is appropriate for your dataset size (larger datasets need more neighbors).

Marker genes not specific to clusters — Low-specificity markers (expressed in many clusters) indicate over-clustering or biological continuum. Merge similar clusters using sc.tl.dendrogram to identify which clusters are most similar, then re-annotate.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates