Ultimate Scvi Tools
Streamline your workflow with this skill, should, used, working. Includes structured workflows, validation checks, and reusable patterns for scientific.
Ultimate Scvi Tools
Analyze single-cell multi-omics data using scvi-tools, a framework for probabilistic deep generative models built on PyTorch. This skill covers variational autoencoders for scRNA-seq, data integration across batches and modalities, differential expression testing, and reference-query cell type annotation.
When to Use This Skill
Choose Ultimate Scvi Tools when you need to:
- Integrate single-cell RNA-seq datasets across batches, donors, or technologies
- Perform probabilistic differential expression with Bayesian statistical framework
- Annotate cell types by transferring labels from reference to query datasets
- Analyze multi-modal data (RNA + ATAC, RNA + protein with CITE-seq)
Consider alternatives when:
- You need basic scRNA-seq analysis without integration (use Scanpy)
- You need trajectory analysis or RNA velocity (use scVelo or CellRank)
- You need spatial transcriptomics analysis (use Squidpy or SpatialData)
Quick Start
pip install scvi-tools scanpy
import scvi import scanpy as sc import numpy as np # Load example dataset adata = sc.datasets.pbmc3k_processed() # Prepare for scVI adata.layers["counts"] = adata.X.copy() # scVI needs raw counts sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor='seurat_v3', layer='counts') # Setup and train scVI model scvi.model.SCVI.setup_anndata( adata, layer="counts", categorical_covariate_keys=["louvain"], # batch key if available ) model = scvi.model.SCVI(adata, n_latent=10, n_layers=2) model.train(max_epochs=100, early_stopping=True) # Get latent representation latent = model.get_latent_representation() adata.obsm["X_scVI"] = latent # Use scVI latent space for downstream analysis sc.pp.neighbors(adata, use_rep="X_scVI") sc.tl.umap(adata) sc.pl.umap(adata, color=["louvain"], save="_scvi.pdf") print(f"Model trained: {model.history['elbo_train'].shape[0]} epochs") print(f"Latent dimensions: {latent.shape}")
Core Concepts
Available Models
| Model | Class | Data Type | Purpose |
|---|---|---|---|
| scVI | scvi.model.SCVI | scRNA-seq | Batch correction, dimensionality reduction |
| scANVI | scvi.model.SCANVI | scRNA-seq + labels | Semi-supervised cell annotation |
| totalVI | scvi.model.TOTALVI | CITE-seq (RNA + protein) | Multi-modal integration |
| MultiVI | scvi.model.MULTIVI | RNA + ATAC | Multi-omic integration |
| SOLO | scvi.external.SOLO | scRNA-seq | Doublet detection |
| DestVI | scvi.model.DestVI | Spatial + scRNA-seq | Spatial deconvolution |
| PeakVI | scvi.model.PEAKVI | scATAC-seq | Chromatin accessibility |
Batch Integration + Differential Expression
import scvi import scanpy as sc import numpy as np # Load multi-batch dataset adata = sc.read_h5ad("multi_batch_data.h5ad") # Setup with batch key scvi.model.SCVI.setup_anndata( adata, layer="counts", batch_key="batch", # Key column for batch correction categorical_covariate_keys=["sample_id"], continuous_covariate_keys=["percent_mito"], ) # Train with batch correction model = scvi.model.SCVI( adata, n_latent=30, n_layers=2, gene_likelihood="zinb", # Zero-inflated negative binomial dispersion="gene-batch", # Per-gene, per-batch dispersion ) model.train(max_epochs=200, early_stopping=True, batch_size=256) # Differential expression between cell types de_results = model.differential_expression( groupby="cell_type", group1="T cells", group2="B cells", delta=0.25, # Log fold change threshold for significance ) # Filter significant DE genes significant = de_results[ (de_results['is_de_fdr_0.05'] == True) & (de_results['lfc_mean'].abs() > 0.5) ].sort_values('bayes_factor', ascending=False) print(f"Significant DE genes: {len(significant)}") print(significant.head(10)[['lfc_mean', 'lfc_std', 'bayes_factor']])
Reference-Query Annotation (scANVI)
import scvi import scanpy as sc # Train on reference (labeled) data ref_adata = sc.read_h5ad("reference.h5ad") scvi.model.SCVI.setup_anndata(ref_adata, layer="counts", batch_key="batch") vae = scvi.model.SCVI(ref_adata, n_latent=30) vae.train(max_epochs=100) # Convert to scANVI for semi-supervised learning scanvi = scvi.model.SCANVI.from_scvi_model( vae, unlabeled_category="Unknown", labels_key="cell_type", # Known labels in reference ) scanvi.train(max_epochs=50) # Annotate query data query_adata = sc.read_h5ad("query.h5ad") scanvi_query = scanvi.load_query_data(query_adata) scanvi_query.train(max_epochs=100, plan_kwargs={"weight_decay": 0.0}) # Transfer labels predictions = scanvi_query.predict() query_adata.obs["predicted_cell_type"] = predictions print(query_adata.obs["predicted_cell_type"].value_counts())
Configuration
| Parameter | Description | Default |
|---|---|---|
n_latent | Latent space dimensions | 10 |
n_layers | Neural network depth | 1 |
n_hidden | Hidden layer width | 128 |
gene_likelihood | Count distribution (zinb, nb, poisson) | "zinb" |
dispersion | Dispersion parameter sharing | "gene" |
max_epochs | Maximum training epochs | 400 |
batch_size | Training mini-batch size | 128 |
early_stopping | Stop when validation loss plateaus | False |
lr | Learning rate | 0.001 |
Best Practices
-
Use raw counts as input, not normalized data — scVI models learn the data-generating process from raw counts using negative binomial or zero-inflated distributions. Providing log-normalized or scaled data violates model assumptions and produces poor results. Store counts in
adata.layers["counts"]and setlayer="counts"during setup. -
Select highly variable genes before training — scVI works best with 2,000-5,000 highly variable genes. Use
scanpy.pp.highly_variable_genes(flavor='seurat_v3', layer='counts')before setup. Using all genes wastes computation on uninformative features and may hurt integration quality. -
Tune
n_latentbased on dataset complexity — Use 10 for simple datasets (<5 cell types), 30 for complex tissues (>15 cell types), and 50+ for whole-organism atlases. Too few dimensions lose biological signal; too many capture noise. Evaluate by checking if known cell types separate in the latent UMAP. -
Enable early stopping for reliable training — Set
early_stopping=Trueto automatically stop when validation loss plateaus. Without it, models often overtrain and the latent space memorizes batch-specific artifacts rather than learning shared biology. -
Use Bayesian differential expression over Wilcoxon — scVI's
differential_expression()produces calibrated Bayes factors and accounts for uncertainty in expression estimates, unlike Wilcoxon rank-sum which treats point estimates as ground truth. Thedeltaparameter controls the minimum fold change for biological significance.
Common Issues
Training loss is NaN or exploding — This usually indicates extreme values in the count matrix. Check for negative values, non-integer counts, or extremely large counts (>10,000). Filter cells with sc.pp.filter_cells(adata, min_counts=100) and genes with sc.pp.filter_genes(adata, min_cells=3) before training.
Batch correction removes biological variation — Over-correction happens when biological and batch effects are confounded (e.g., one batch has only T cells). Include known biological covariates in categorical_covariate_keys to preserve them, or stratify batches so each contains multiple cell types.
scANVI predictions are mostly "Unknown" — The model lacks confidence to assign labels. Increase training epochs, ensure reference data covers the cell types present in the query, and check that reference labels are accurate. Low-quality reference annotations propagate errors to query predictions.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.