Pro Anndata Workspace
Streamline your workflow with this skill, should, used, working. Includes structured workflows, validation checks, and reusable patterns for scientific.
Pro AnnData Workspace
A scientific computing skill for working with AnnData — the standard Python data structure for annotated single-cell genomics data. Pro AnnData Workspace helps you create, manipulate, and analyze AnnData objects for scRNA-seq, scATAC-seq, and other single-cell assay data, with tight integration into the Scanpy ecosystem.
When to Use This Skill
Choose Pro AnnData Workspace when:
- Loading, creating, or converting single-cell datasets to AnnData format
- Manipulating observation (cell) and variable (gene) annotations
- Subsetting, concatenating, or transforming AnnData objects
- Preparing data for downstream analysis with Scanpy or other tools
Consider alternatives when:
- You need full single-cell analysis pipelines (use Scanpy directly)
- You're working with bulk RNA-seq (use DESeq2 or edgeR)
- You need spatial transcriptomics (use Squidpy or SpatialData)
- You're doing general data manipulation without genomics context (use pandas)
Quick Start
claude "Create an AnnData object from my single-cell count matrix"
import anndata as ad import numpy as np import pandas as pd from scipy.sparse import csr_matrix # Create AnnData from scratch n_cells, n_genes = 1000, 2000 counts = csr_matrix(np.random.poisson(0.5, (n_cells, n_genes))) adata = ad.AnnData( X=counts, obs=pd.DataFrame({ "cell_type": np.random.choice(["T cell", "B cell", "Monocyte"], n_cells), "sample": np.random.choice(["patient_1", "patient_2"], n_cells), }, index=[f"cell_{i}" for i in range(n_cells)]), var=pd.DataFrame({ "gene_name": [f"Gene_{i}" for i in range(n_genes)], "highly_variable": np.random.choice([True, False], n_genes), }, index=[f"ENSG{i:011d}" for i in range(n_genes)]) ) print(adata) # AnnData object with n_obs × n_vars = 1000 × 2000 # obs: 'cell_type', 'sample' # var: 'gene_name', 'highly_variable'
Core Concepts
AnnData Structure
| Component | Attribute | Description |
|---|---|---|
| Expression matrix | adata.X | Cells × genes count matrix (sparse or dense) |
| Observations | adata.obs | Cell-level metadata (DataFrame) |
| Variables | adata.var | Gene-level metadata (DataFrame) |
| Unstructured | adata.uns | Arbitrary metadata (dict) |
| Obsm | adata.obsm | Cell embeddings (UMAP, PCA) |
| Varm | adata.varm | Gene embeddings |
| Layers | adata.layers | Alternative matrices (raw, normalized) |
| Obsp | adata.obsp | Cell-cell graphs (neighbors) |
Common Operations
# Reading data adata = ad.read_h5ad("data.h5ad") adata = ad.read_csv("counts.csv") adata = ad.read_loom("data.loom") # Subsetting t_cells = adata[adata.obs["cell_type"] == "T cell"] top_genes = adata[:, adata.var["highly_variable"]] # Concatenating multiple datasets combined = ad.concat([adata1, adata2], label="batch", keys=["sample1", "sample2"]) # Storing layers adata.layers["raw"] = adata.X.copy() adata.layers["normalized"] = normalize(adata.X) # Saving adata.write_h5ad("processed.h5ad")
Integration with Scanpy
import scanpy as sc # Standard preprocessing pipeline sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_genes(adata, min_cells=3) adata.layers["counts"] = adata.X.copy() sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000) sc.pp.pca(adata) # Stored in adata.obsm["X_pca"] sc.pp.neighbors(adata) # Stored in adata.obsp["connectivities"] sc.tl.umap(adata) # Stored in adata.obsm["X_umap"] sc.tl.leiden(adata) # Stored in adata.obs["leiden"]
Configuration
| Parameter | Description | Default |
|---|---|---|
backed_mode | Load data lazily from disk | None (in-memory) |
compression | H5AD compression (gzip, lzf) | None |
dtype | Matrix data type | float32 |
sparse_format | CSR or CSC sparse matrix | csr |
file_format | h5ad, loom, or zarr | h5ad |
Best Practices
-
Store raw counts in a layer. Before normalizing or transforming, save the raw count matrix with
adata.layers["counts"] = adata.X.copy(). Many downstream tools need raw counts, and you can't reverse a log transformation without losing precision. -
Use sparse matrices for large datasets. Single-cell count matrices are typically >95% zeros. Use
scipy.sparse.csr_matrixto reduce memory by 10-50x. Most AnnData operations and Scanpy functions work seamlessly with sparse matrices. -
Set meaningful index values. Use cell barcodes as
adata.obs.indexand gene symbols or Ensembl IDs asadata.var.index. This makes subsetting intuitive:adata[:, "TP53"]to get a specific gene. -
Use
adata.unsfor experiment metadata. Store batch information, processing parameters, and analysis provenance inadata.uns. This keeps metadata with the data and makes analyses reproducible. -
Write to
.h5adwith compression for archiving. Useadata.write_h5ad("data.h5ad", compression="gzip")for long-term storage. The compressed file is typically 5-10x smaller than uncompressed, with minimal read-time overhead.
Common Issues
Memory error when loading large datasets. Use backed mode: adata = ad.read_h5ad("data.h5ad", backed="r") to load data lazily. This reads only the metadata into memory and accesses the matrix on demand. Note that backed mode doesn't support all operations — convert to in-memory for complex analyses.
Concatenation produces unexpected NaN values. When concatenating AnnData objects with different gene sets, missing genes are filled with NaN or zeros depending on the fill_value parameter. Use ad.concat(..., join="inner") to keep only shared genes, or join="outer" with explicit fill_value=0.
Index duplication error after subsetting or concatenation. AnnData requires unique indices. After concatenation, use adata.obs_names_make_unique() and adata.var_names_make_unique(). For meaningful uniqueness, include batch identifiers in cell barcodes before concatenating.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.