Ultimate CellxGene Framework

A scientific computing skill for working with CellxGene (Cell by Gene) — the interactive single-cell data explorer and data portal maintained by the Chan Zuckerberg Initiative. Ultimate CellxGene Framework helps you discover, download, and analyze published single-cell datasets through the CellxGene Census API and Discover portal.

When to Use This Skill

Choose Ultimate CellxGene Framework when:

Searching the CellxGene Discover portal for published single-cell datasets
Downloading annotated scRNA-seq data via the CellxGene Census API
Querying across hundreds of datasets for specific cell types or genes
Building reference atlases from curated single-cell data

Consider alternatives when:

You need to analyze your own unpublished data (use Scanpy directly)
You need bulk RNA-seq data (use GEO or ArrayExpress)
You want the CellxGene visualization tool locally (use cellxgene launch)
You need spatial transcriptomics data (use Squidpy or SpatialData)

Quick Start


claude "Download human lung T cell data from CellxGene Census"


import cellxgene_census

# Open the Census (latest release)
census = cellxgene_census.open_soma()

# Query for human lung T cells
adata = cellxgene_census.get_anndata(
    census,
    organism="Homo sapiens",
    obs_value_filter=(
        "tissue_general == 'lung' and "
        "cell_type == 'T cell'"
    ),
    column_names={
        "obs": ["cell_type", "tissue", "disease", "donor_id",
                "dataset_id", "assay"],
        "var": ["feature_name", "feature_id"]
    }
)

print(f"Cells: {adata.n_obs}")
print(f"Genes: {adata.n_vars}")
print(f"Datasets: {adata.obs['dataset_id'].nunique()}")
print(f"Cell types: {adata.obs['cell_type'].value_counts()}")

census.close()

Core Concepts

CellxGene Census API

Function	Purpose	Returns
`open_soma()`	Connect to Census database	Census connection
`get_anndata()`	Query and download data	AnnData object
`get_obs()`	Query cell metadata only	DataFrame
`get_var()`	Query gene metadata only	DataFrame
`get_presence_matrix()`	Which genes in which datasets	Sparse matrix

Query Filters


import cellxgene_census

census = cellxgene_census.open_soma()

# Filter by tissue and disease
disease_cells = cellxgene_census.get_anndata(
    census,
    organism="Homo sapiens",
    obs_value_filter=(
        "disease != 'normal' and "
        "tissue_general == 'brain'"
    )
)

# Filter by specific cell type and assay
specific = cellxgene_census.get_anndata(
    census,
    organism="Homo sapiens",
    obs_value_filter=(
        "cell_type == 'hepatocyte' and "
        "assay == '10x 3\\' v3'"
    ),
    var_value_filter="feature_name in ['ALB', 'AFP', 'HNF4A', 'CYP3A4']"
)

# Get metadata without expression data (fast)
obs_df = cellxgene_census.get_obs(
    census,
    organism="Homo sapiens",
    value_filter="tissue_general == 'heart'",
    column_names=["cell_type", "disease", "donor_id"]
)
print(obs_df["cell_type"].value_counts())

census.close()

Building Reference Atlases


import cellxgene_census
import scanpy as sc

census = cellxgene_census.open_soma()

# Download a tissue-specific reference atlas
kidney_atlas = cellxgene_census.get_anndata(
    census,
    organism="Homo sapiens",
    obs_value_filter="tissue_general == 'kidney' and disease == 'normal'",
    column_names={
        "obs": ["cell_type", "tissue", "donor_id", "dataset_id",
                "assay", "sex", "development_stage"],
    }
)

# Standard analysis pipeline on downloaded data
sc.pp.filter_cells(kidney_atlas, min_genes=200)
sc.pp.normalize_total(kidney_atlas, target_sum=1e4)
sc.pp.log1p(kidney_atlas)
sc.pp.highly_variable_genes(kidney_atlas, n_top_genes=3000)
sc.pp.pca(kidney_atlas)
sc.pp.neighbors(kidney_atlas)
sc.tl.umap(kidney_atlas)
sc.tl.leiden(kidney_atlas, resolution=0.5)

print(f"Atlas: {kidney_atlas.n_obs} cells, "
      f"{kidney_atlas.obs['cell_type'].nunique()} cell types")

census.close()

Configuration

Parameter	Description	Default
`census_version`	Specific Census release	`"latest"`
`organism`	Homo sapiens or Mus musculus	Required
`obs_value_filter`	Cell-level filter expression	None (all cells)
`var_value_filter`	Gene-level filter expression	None (all genes)
`column_names`	Metadata columns to include	All columns

Best Practices

Start with metadata queries before downloading expression data. Use get_obs() to explore available tissues, cell types, and diseases before downloading the full expression matrix. This prevents downloading gigabytes of data you don't need.
Filter to specific genes when possible. If you only need a handful of marker genes, use var_value_filter to retrieve only those genes. This dramatically reduces download size and memory usage compared to fetching the full transcriptome.
Account for batch effects across datasets. CellxGene Census aggregates data from many studies with different protocols. Use batch correction methods (Harmony, scVI, BBKNN) when combining cells from multiple dataset_id values. Raw integration without correction produces misleading clusters.
Use the Census version parameter for reproducibility. Pin a specific Census release version in your analysis scripts. The "latest" version updates as new datasets are added, which can change your results. For publications, always specify the exact version.
Close the Census connection when done. Always call census.close() after your queries complete. The Census uses SOMA (Stack of Matrices, Annotated) format which maintains active connections. Unclosed connections may leak resources.

Common Issues

Download is extremely slow or runs out of memory. You're trying to download too many cells. Add more specific filters — tissue, cell type, disease, or assay — to reduce the query size. For large-scale analyses, use the TileDB-SOMA API directly which supports out-of-core processing.

Cell type annotations don't match expected labels. CellxGene uses a controlled vocabulary (Cell Ontology) that may differ from labels used in individual papers. Check the available cell types with get_obs() first and use the exact ontology terms in your filters.

Expression values seem unnormalized. CellxGene Census provides raw counts by default, not normalized values. Apply your own normalization pipeline (e.g., sc.pp.normalize_total() + sc.pp.log1p()) after downloading. This is by design — different analyses require different normalization strategies.

⚠️ Loading Issue

Ultimate Cellxgene Framework

Ultimate CellxGene Framework

When to Use This Skill

Quick Start

Core Concepts

CellxGene Census API

Query Filters

Building Reference Atlases

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace