Ultimate Scvi Tools

Analyze single-cell multi-omics data using scvi-tools, a framework for probabilistic deep generative models built on PyTorch. This skill covers variational autoencoders for scRNA-seq, data integration across batches and modalities, differential expression testing, and reference-query cell type annotation.

When to Use This Skill

Choose Ultimate Scvi Tools when you need to:

Integrate single-cell RNA-seq datasets across batches, donors, or technologies
Perform probabilistic differential expression with Bayesian statistical framework
Annotate cell types by transferring labels from reference to query datasets
Analyze multi-modal data (RNA + ATAC, RNA + protein with CITE-seq)

Consider alternatives when:

You need basic scRNA-seq analysis without integration (use Scanpy)
You need trajectory analysis or RNA velocity (use scVelo or CellRank)
You need spatial transcriptomics analysis (use Squidpy or SpatialData)

Quick Start


pip install scvi-tools scanpy


import scvi
import scanpy as sc
import numpy as np

# Load example dataset
adata = sc.datasets.pbmc3k_processed()

# Prepare for scVI
adata.layers["counts"] = adata.X.copy()  # scVI needs raw counts
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor='seurat_v3',
                             layer='counts')

# Setup and train scVI model
scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",
    categorical_covariate_keys=["louvain"],  # batch key if available
)

model = scvi.model.SCVI(adata, n_latent=10, n_layers=2)
model.train(max_epochs=100, early_stopping=True)

# Get latent representation
latent = model.get_latent_representation()
adata.obsm["X_scVI"] = latent

# Use scVI latent space for downstream analysis
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata)
sc.pl.umap(adata, color=["louvain"], save="_scvi.pdf")

print(f"Model trained: {model.history['elbo_train'].shape[0]} epochs")
print(f"Latent dimensions: {latent.shape}")

Core Concepts

Available Models

Model	Class	Data Type	Purpose
scVI	`scvi.model.SCVI`	scRNA-seq	Batch correction, dimensionality reduction
scANVI	`scvi.model.SCANVI`	scRNA-seq + labels	Semi-supervised cell annotation
totalVI	`scvi.model.TOTALVI`	CITE-seq (RNA + protein)	Multi-modal integration
MultiVI	`scvi.model.MULTIVI`	RNA + ATAC	Multi-omic integration
SOLO	`scvi.external.SOLO`	scRNA-seq	Doublet detection
DestVI	`scvi.model.DestVI`	Spatial + scRNA-seq	Spatial deconvolution
PeakVI	`scvi.model.PEAKVI`	scATAC-seq	Chromatin accessibility

Batch Integration + Differential Expression


import scvi
import scanpy as sc
import numpy as np

# Load multi-batch dataset
adata = sc.read_h5ad("multi_batch_data.h5ad")

# Setup with batch key
scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",
    batch_key="batch",          # Key column for batch correction
    categorical_covariate_keys=["sample_id"],
    continuous_covariate_keys=["percent_mito"],
)

# Train with batch correction
model = scvi.model.SCVI(
    adata, n_latent=30, n_layers=2,
    gene_likelihood="zinb",     # Zero-inflated negative binomial
    dispersion="gene-batch",    # Per-gene, per-batch dispersion
)
model.train(max_epochs=200, early_stopping=True, batch_size=256)

# Differential expression between cell types
de_results = model.differential_expression(
    groupby="cell_type",
    group1="T cells",
    group2="B cells",
    delta=0.25,    # Log fold change threshold for significance
)

# Filter significant DE genes
significant = de_results[
    (de_results['is_de_fdr_0.05'] == True) &
    (de_results['lfc_mean'].abs() > 0.5)
].sort_values('bayes_factor', ascending=False)

print(f"Significant DE genes: {len(significant)}")
print(significant.head(10)[['lfc_mean', 'lfc_std', 'bayes_factor']])

Reference-Query Annotation (scANVI)


import scvi
import scanpy as sc

# Train on reference (labeled) data
ref_adata = sc.read_h5ad("reference.h5ad")
scvi.model.SCVI.setup_anndata(ref_adata, layer="counts", batch_key="batch")
vae = scvi.model.SCVI(ref_adata, n_latent=30)
vae.train(max_epochs=100)

# Convert to scANVI for semi-supervised learning
scanvi = scvi.model.SCANVI.from_scvi_model(
    vae,
    unlabeled_category="Unknown",
    labels_key="cell_type",      # Known labels in reference
)
scanvi.train(max_epochs=50)

# Annotate query data
query_adata = sc.read_h5ad("query.h5ad")
scanvi_query = scanvi.load_query_data(query_adata)
scanvi_query.train(max_epochs=100, plan_kwargs={"weight_decay": 0.0})

# Transfer labels
predictions = scanvi_query.predict()
query_adata.obs["predicted_cell_type"] = predictions
print(query_adata.obs["predicted_cell_type"].value_counts())

Configuration

Parameter	Description	Default
`n_latent`	Latent space dimensions	`10`
`n_layers`	Neural network depth	`1`
`n_hidden`	Hidden layer width	`128`
`gene_likelihood`	Count distribution (zinb, nb, poisson)	`"zinb"`
`dispersion`	Dispersion parameter sharing	`"gene"`
`max_epochs`	Maximum training epochs	`400`
`batch_size`	Training mini-batch size	`128`
`early_stopping`	Stop when validation loss plateaus	`False`
`lr`	Learning rate	`0.001`

Best Practices

Use raw counts as input, not normalized data — scVI models learn the data-generating process from raw counts using negative binomial or zero-inflated distributions. Providing log-normalized or scaled data violates model assumptions and produces poor results. Store counts in adata.layers["counts"] and set layer="counts" during setup.
Select highly variable genes before training — scVI works best with 2,000-5,000 highly variable genes. Use scanpy.pp.highly_variable_genes(flavor='seurat_v3', layer='counts') before setup. Using all genes wastes computation on uninformative features and may hurt integration quality.
Tune n_latent based on dataset complexity — Use 10 for simple datasets (<5 cell types), 30 for complex tissues (>15 cell types), and 50+ for whole-organism atlases. Too few dimensions lose biological signal; too many capture noise. Evaluate by checking if known cell types separate in the latent UMAP.
Enable early stopping for reliable training — Set early_stopping=True to automatically stop when validation loss plateaus. Without it, models often overtrain and the latent space memorizes batch-specific artifacts rather than learning shared biology.
Use Bayesian differential expression over Wilcoxon — scVI's differential_expression() produces calibrated Bayes factors and accounts for uncertainty in expression estimates, unlike Wilcoxon rank-sum which treats point estimates as ground truth. The delta parameter controls the minimum fold change for biological significance.

Common Issues

Training loss is NaN or exploding — This usually indicates extreme values in the count matrix. Check for negative values, non-integer counts, or extremely large counts (>10,000). Filter cells with sc.pp.filter_cells(adata, min_counts=100) and genes with sc.pp.filter_genes(adata, min_cells=3) before training.

Batch correction removes biological variation — Over-correction happens when biological and batch effects are confounded (e.g., one batch has only T cells). Include known biological covariates in categorical_covariate_keys to preserve them, or stratify batches so each contains multiple cell types.

scANVI predictions are mostly "Unknown" — The model lacks confidence to assign labels. Increase training epochs, ensure reference data covers the cell types present in the query, and check that reference labels are accurate. Low-quality reference annotations propagate errors to query predictions.

⚠️ Loading Issue

Ultimate Scvi Tools

Ultimate Scvi Tools

When to Use This Skill

Quick Start

Core Concepts

Available Models

Batch Integration + Differential Expression

Reference-Query Annotation (scANVI)

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace