Pro AnnData Workspace

A scientific computing skill for working with AnnData — the standard Python data structure for annotated single-cell genomics data. Pro AnnData Workspace helps you create, manipulate, and analyze AnnData objects for scRNA-seq, scATAC-seq, and other single-cell assay data, with tight integration into the Scanpy ecosystem.

When to Use This Skill

Choose Pro AnnData Workspace when:

Loading, creating, or converting single-cell datasets to AnnData format
Manipulating observation (cell) and variable (gene) annotations
Subsetting, concatenating, or transforming AnnData objects
Preparing data for downstream analysis with Scanpy or other tools

Consider alternatives when:

You need full single-cell analysis pipelines (use Scanpy directly)
You're working with bulk RNA-seq (use DESeq2 or edgeR)
You need spatial transcriptomics (use Squidpy or SpatialData)
You're doing general data manipulation without genomics context (use pandas)

Quick Start


claude "Create an AnnData object from my single-cell count matrix"


import anndata as ad
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

# Create AnnData from scratch
n_cells, n_genes = 1000, 2000
counts = csr_matrix(np.random.poisson(0.5, (n_cells, n_genes)))

adata = ad.AnnData(
    X=counts,
    obs=pd.DataFrame({
        "cell_type": np.random.choice(["T cell", "B cell", "Monocyte"], n_cells),
        "sample": np.random.choice(["patient_1", "patient_2"], n_cells),
    }, index=[f"cell_{i}" for i in range(n_cells)]),
    var=pd.DataFrame({
        "gene_name": [f"Gene_{i}" for i in range(n_genes)],
        "highly_variable": np.random.choice([True, False], n_genes),
    }, index=[f"ENSG{i:011d}" for i in range(n_genes)])
)

print(adata)
# AnnData object with n_obs × n_vars = 1000 × 2000
#     obs: 'cell_type', 'sample'
#     var: 'gene_name', 'highly_variable'

Core Concepts

AnnData Structure

Component	Attribute	Description
Expression matrix	`adata.X`	Cells × genes count matrix (sparse or dense)
Observations	`adata.obs`	Cell-level metadata (DataFrame)
Variables	`adata.var`	Gene-level metadata (DataFrame)
Unstructured	`adata.uns`	Arbitrary metadata (dict)
Obsm	`adata.obsm`	Cell embeddings (UMAP, PCA)
Varm	`adata.varm`	Gene embeddings
Layers	`adata.layers`	Alternative matrices (raw, normalized)
Obsp	`adata.obsp`	Cell-cell graphs (neighbors)

Common Operations


# Reading data
adata = ad.read_h5ad("data.h5ad")
adata = ad.read_csv("counts.csv")
adata = ad.read_loom("data.loom")

# Subsetting
t_cells = adata[adata.obs["cell_type"] == "T cell"]
top_genes = adata[:, adata.var["highly_variable"]]

# Concatenating multiple datasets
combined = ad.concat([adata1, adata2], label="batch", keys=["sample1", "sample2"])

# Storing layers
adata.layers["raw"] = adata.X.copy()
adata.layers["normalized"] = normalize(adata.X)

# Saving
adata.write_h5ad("processed.h5ad")

Integration with Scanpy


import scanpy as sc

# Standard preprocessing pipeline
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pp.pca(adata)           # Stored in adata.obsm["X_pca"]
sc.pp.neighbors(adata)     # Stored in adata.obsp["connectivities"]
sc.tl.umap(adata)          # Stored in adata.obsm["X_umap"]
sc.tl.leiden(adata)        # Stored in adata.obs["leiden"]

Configuration

Parameter	Description	Default
`backed_mode`	Load data lazily from disk	`None` (in-memory)
`compression`	H5AD compression (gzip, lzf)	`None`
`dtype`	Matrix data type	`float32`
`sparse_format`	CSR or CSC sparse matrix	`csr`
`file_format`	h5ad, loom, or zarr	`h5ad`

Best Practices

Store raw counts in a layer. Before normalizing or transforming, save the raw count matrix with adata.layers["counts"] = adata.X.copy(). Many downstream tools need raw counts, and you can't reverse a log transformation without losing precision.
Use sparse matrices for large datasets. Single-cell count matrices are typically >95% zeros. Use scipy.sparse.csr_matrix to reduce memory by 10-50x. Most AnnData operations and Scanpy functions work seamlessly with sparse matrices.
Set meaningful index values. Use cell barcodes as adata.obs.index and gene symbols or Ensembl IDs as adata.var.index. This makes subsetting intuitive: adata[:, "TP53"] to get a specific gene.
Use adata.uns for experiment metadata. Store batch information, processing parameters, and analysis provenance in adata.uns. This keeps metadata with the data and makes analyses reproducible.
Write to .h5ad with compression for archiving. Use adata.write_h5ad("data.h5ad", compression="gzip") for long-term storage. The compressed file is typically 5-10x smaller than uncompressed, with minimal read-time overhead.

Common Issues

Memory error when loading large datasets. Use backed mode: adata = ad.read_h5ad("data.h5ad", backed="r") to load data lazily. This reads only the metadata into memory and accesses the matrix on demand. Note that backed mode doesn't support all operations — convert to in-memory for complex analyses.

Concatenation produces unexpected NaN values. When concatenating AnnData objects with different gene sets, missing genes are filled with NaN or zeros depending on the fill_value parameter. Use ad.concat(..., join="inner") to keep only shared genes, or join="outer" with explicit fill_value=0.

Index duplication error after subsetting or concatenation. AnnData requires unique indices. After concatenation, use adata.obs_names_make_unique() and adata.var_names_make_unique(). For meaningful uniqueness, include batch identifiers in cell barcodes before concatenating.

⚠️ Loading Issue

Pro Anndata Workspace

Pro AnnData Workspace

When to Use This Skill

Quick Start

Core Concepts

AnnData Structure

Common Operations

Integration with Scanpy

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace