P

Pro Anndata Workspace

Streamline your workflow with this skill, should, used, working. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Pro AnnData Workspace

A scientific computing skill for working with AnnData — the standard Python data structure for annotated single-cell genomics data. Pro AnnData Workspace helps you create, manipulate, and analyze AnnData objects for scRNA-seq, scATAC-seq, and other single-cell assay data, with tight integration into the Scanpy ecosystem.

When to Use This Skill

Choose Pro AnnData Workspace when:

  • Loading, creating, or converting single-cell datasets to AnnData format
  • Manipulating observation (cell) and variable (gene) annotations
  • Subsetting, concatenating, or transforming AnnData objects
  • Preparing data for downstream analysis with Scanpy or other tools

Consider alternatives when:

  • You need full single-cell analysis pipelines (use Scanpy directly)
  • You're working with bulk RNA-seq (use DESeq2 or edgeR)
  • You need spatial transcriptomics (use Squidpy or SpatialData)
  • You're doing general data manipulation without genomics context (use pandas)

Quick Start

claude "Create an AnnData object from my single-cell count matrix"
import anndata as ad import numpy as np import pandas as pd from scipy.sparse import csr_matrix # Create AnnData from scratch n_cells, n_genes = 1000, 2000 counts = csr_matrix(np.random.poisson(0.5, (n_cells, n_genes))) adata = ad.AnnData( X=counts, obs=pd.DataFrame({ "cell_type": np.random.choice(["T cell", "B cell", "Monocyte"], n_cells), "sample": np.random.choice(["patient_1", "patient_2"], n_cells), }, index=[f"cell_{i}" for i in range(n_cells)]), var=pd.DataFrame({ "gene_name": [f"Gene_{i}" for i in range(n_genes)], "highly_variable": np.random.choice([True, False], n_genes), }, index=[f"ENSG{i:011d}" for i in range(n_genes)]) ) print(adata) # AnnData object with n_obs × n_vars = 1000 × 2000 # obs: 'cell_type', 'sample' # var: 'gene_name', 'highly_variable'

Core Concepts

AnnData Structure

ComponentAttributeDescription
Expression matrixadata.XCells × genes count matrix (sparse or dense)
Observationsadata.obsCell-level metadata (DataFrame)
Variablesadata.varGene-level metadata (DataFrame)
Unstructuredadata.unsArbitrary metadata (dict)
Obsmadata.obsmCell embeddings (UMAP, PCA)
Varmadata.varmGene embeddings
Layersadata.layersAlternative matrices (raw, normalized)
Obspadata.obspCell-cell graphs (neighbors)

Common Operations

# Reading data adata = ad.read_h5ad("data.h5ad") adata = ad.read_csv("counts.csv") adata = ad.read_loom("data.loom") # Subsetting t_cells = adata[adata.obs["cell_type"] == "T cell"] top_genes = adata[:, adata.var["highly_variable"]] # Concatenating multiple datasets combined = ad.concat([adata1, adata2], label="batch", keys=["sample1", "sample2"]) # Storing layers adata.layers["raw"] = adata.X.copy() adata.layers["normalized"] = normalize(adata.X) # Saving adata.write_h5ad("processed.h5ad")

Integration with Scanpy

import scanpy as sc # Standard preprocessing pipeline sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_genes(adata, min_cells=3) adata.layers["counts"] = adata.X.copy() sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000) sc.pp.pca(adata) # Stored in adata.obsm["X_pca"] sc.pp.neighbors(adata) # Stored in adata.obsp["connectivities"] sc.tl.umap(adata) # Stored in adata.obsm["X_umap"] sc.tl.leiden(adata) # Stored in adata.obs["leiden"]

Configuration

ParameterDescriptionDefault
backed_modeLoad data lazily from diskNone (in-memory)
compressionH5AD compression (gzip, lzf)None
dtypeMatrix data typefloat32
sparse_formatCSR or CSC sparse matrixcsr
file_formath5ad, loom, or zarrh5ad

Best Practices

  1. Store raw counts in a layer. Before normalizing or transforming, save the raw count matrix with adata.layers["counts"] = adata.X.copy(). Many downstream tools need raw counts, and you can't reverse a log transformation without losing precision.

  2. Use sparse matrices for large datasets. Single-cell count matrices are typically >95% zeros. Use scipy.sparse.csr_matrix to reduce memory by 10-50x. Most AnnData operations and Scanpy functions work seamlessly with sparse matrices.

  3. Set meaningful index values. Use cell barcodes as adata.obs.index and gene symbols or Ensembl IDs as adata.var.index. This makes subsetting intuitive: adata[:, "TP53"] to get a specific gene.

  4. Use adata.uns for experiment metadata. Store batch information, processing parameters, and analysis provenance in adata.uns. This keeps metadata with the data and makes analyses reproducible.

  5. Write to .h5ad with compression for archiving. Use adata.write_h5ad("data.h5ad", compression="gzip") for long-term storage. The compressed file is typically 5-10x smaller than uncompressed, with minimal read-time overhead.

Common Issues

Memory error when loading large datasets. Use backed mode: adata = ad.read_h5ad("data.h5ad", backed="r") to load data lazily. This reads only the metadata into memory and accesses the matrix on demand. Note that backed mode doesn't support all operations — convert to in-memory for complex analyses.

Concatenation produces unexpected NaN values. When concatenating AnnData objects with different gene sets, missing genes are filled with NaN or zeros depending on the fill_value parameter. Use ad.concat(..., join="inner") to keep only shared genes, or join="outer" with explicit fill_value=0.

Index duplication error after subsetting or concatenation. AnnData requires unique indices. After concatenation, use adata.obs_names_make_unique() and adata.var_names_make_unique(). For meaningful uniqueness, include batch identifiers in cell barcodes before concatenating.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates