U

Ultimate Scvi Tools

Streamline your workflow with this skill, should, used, working. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Ultimate Scvi Tools

Analyze single-cell multi-omics data using scvi-tools, a framework for probabilistic deep generative models built on PyTorch. This skill covers variational autoencoders for scRNA-seq, data integration across batches and modalities, differential expression testing, and reference-query cell type annotation.

When to Use This Skill

Choose Ultimate Scvi Tools when you need to:

  • Integrate single-cell RNA-seq datasets across batches, donors, or technologies
  • Perform probabilistic differential expression with Bayesian statistical framework
  • Annotate cell types by transferring labels from reference to query datasets
  • Analyze multi-modal data (RNA + ATAC, RNA + protein with CITE-seq)

Consider alternatives when:

  • You need basic scRNA-seq analysis without integration (use Scanpy)
  • You need trajectory analysis or RNA velocity (use scVelo or CellRank)
  • You need spatial transcriptomics analysis (use Squidpy or SpatialData)

Quick Start

pip install scvi-tools scanpy
import scvi import scanpy as sc import numpy as np # Load example dataset adata = sc.datasets.pbmc3k_processed() # Prepare for scVI adata.layers["counts"] = adata.X.copy() # scVI needs raw counts sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor='seurat_v3', layer='counts') # Setup and train scVI model scvi.model.SCVI.setup_anndata( adata, layer="counts", categorical_covariate_keys=["louvain"], # batch key if available ) model = scvi.model.SCVI(adata, n_latent=10, n_layers=2) model.train(max_epochs=100, early_stopping=True) # Get latent representation latent = model.get_latent_representation() adata.obsm["X_scVI"] = latent # Use scVI latent space for downstream analysis sc.pp.neighbors(adata, use_rep="X_scVI") sc.tl.umap(adata) sc.pl.umap(adata, color=["louvain"], save="_scvi.pdf") print(f"Model trained: {model.history['elbo_train'].shape[0]} epochs") print(f"Latent dimensions: {latent.shape}")

Core Concepts

Available Models

ModelClassData TypePurpose
scVIscvi.model.SCVIscRNA-seqBatch correction, dimensionality reduction
scANVIscvi.model.SCANVIscRNA-seq + labelsSemi-supervised cell annotation
totalVIscvi.model.TOTALVICITE-seq (RNA + protein)Multi-modal integration
MultiVIscvi.model.MULTIVIRNA + ATACMulti-omic integration
SOLOscvi.external.SOLOscRNA-seqDoublet detection
DestVIscvi.model.DestVISpatial + scRNA-seqSpatial deconvolution
PeakVIscvi.model.PEAKVIscATAC-seqChromatin accessibility

Batch Integration + Differential Expression

import scvi import scanpy as sc import numpy as np # Load multi-batch dataset adata = sc.read_h5ad("multi_batch_data.h5ad") # Setup with batch key scvi.model.SCVI.setup_anndata( adata, layer="counts", batch_key="batch", # Key column for batch correction categorical_covariate_keys=["sample_id"], continuous_covariate_keys=["percent_mito"], ) # Train with batch correction model = scvi.model.SCVI( adata, n_latent=30, n_layers=2, gene_likelihood="zinb", # Zero-inflated negative binomial dispersion="gene-batch", # Per-gene, per-batch dispersion ) model.train(max_epochs=200, early_stopping=True, batch_size=256) # Differential expression between cell types de_results = model.differential_expression( groupby="cell_type", group1="T cells", group2="B cells", delta=0.25, # Log fold change threshold for significance ) # Filter significant DE genes significant = de_results[ (de_results['is_de_fdr_0.05'] == True) & (de_results['lfc_mean'].abs() > 0.5) ].sort_values('bayes_factor', ascending=False) print(f"Significant DE genes: {len(significant)}") print(significant.head(10)[['lfc_mean', 'lfc_std', 'bayes_factor']])

Reference-Query Annotation (scANVI)

import scvi import scanpy as sc # Train on reference (labeled) data ref_adata = sc.read_h5ad("reference.h5ad") scvi.model.SCVI.setup_anndata(ref_adata, layer="counts", batch_key="batch") vae = scvi.model.SCVI(ref_adata, n_latent=30) vae.train(max_epochs=100) # Convert to scANVI for semi-supervised learning scanvi = scvi.model.SCANVI.from_scvi_model( vae, unlabeled_category="Unknown", labels_key="cell_type", # Known labels in reference ) scanvi.train(max_epochs=50) # Annotate query data query_adata = sc.read_h5ad("query.h5ad") scanvi_query = scanvi.load_query_data(query_adata) scanvi_query.train(max_epochs=100, plan_kwargs={"weight_decay": 0.0}) # Transfer labels predictions = scanvi_query.predict() query_adata.obs["predicted_cell_type"] = predictions print(query_adata.obs["predicted_cell_type"].value_counts())

Configuration

ParameterDescriptionDefault
n_latentLatent space dimensions10
n_layersNeural network depth1
n_hiddenHidden layer width128
gene_likelihoodCount distribution (zinb, nb, poisson)"zinb"
dispersionDispersion parameter sharing"gene"
max_epochsMaximum training epochs400
batch_sizeTraining mini-batch size128
early_stoppingStop when validation loss plateausFalse
lrLearning rate0.001

Best Practices

  1. Use raw counts as input, not normalized data — scVI models learn the data-generating process from raw counts using negative binomial or zero-inflated distributions. Providing log-normalized or scaled data violates model assumptions and produces poor results. Store counts in adata.layers["counts"] and set layer="counts" during setup.

  2. Select highly variable genes before training — scVI works best with 2,000-5,000 highly variable genes. Use scanpy.pp.highly_variable_genes(flavor='seurat_v3', layer='counts') before setup. Using all genes wastes computation on uninformative features and may hurt integration quality.

  3. Tune n_latent based on dataset complexity — Use 10 for simple datasets (<5 cell types), 30 for complex tissues (>15 cell types), and 50+ for whole-organism atlases. Too few dimensions lose biological signal; too many capture noise. Evaluate by checking if known cell types separate in the latent UMAP.

  4. Enable early stopping for reliable training — Set early_stopping=True to automatically stop when validation loss plateaus. Without it, models often overtrain and the latent space memorizes batch-specific artifacts rather than learning shared biology.

  5. Use Bayesian differential expression over Wilcoxon — scVI's differential_expression() produces calibrated Bayes factors and accounts for uncertainty in expression estimates, unlike Wilcoxon rank-sum which treats point estimates as ground truth. The delta parameter controls the minimum fold change for biological significance.

Common Issues

Training loss is NaN or exploding — This usually indicates extreme values in the count matrix. Check for negative values, non-integer counts, or extremely large counts (>10,000). Filter cells with sc.pp.filter_cells(adata, min_counts=100) and genes with sc.pp.filter_genes(adata, min_cells=3) before training.

Batch correction removes biological variation — Over-correction happens when biological and batch effects are confounded (e.g., one batch has only T cells). Include known biological covariates in categorical_covariate_keys to preserve them, or stratify batches so each contains multiple cell types.

scANVI predictions are mostly "Unknown" — The model lacks confidence to assign labels. Increase training epochs, ensure reference data covers the cell types present in the query, and check that reference labels are accurate. Low-quality reference annotations propagate errors to query predictions.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates