Ultimate Cellxgene Framework
Production-ready skill that handles query, cellxgene, census, cells. Includes structured workflows, validation checks, and reusable patterns for scientific.
Ultimate CellxGene Framework
A scientific computing skill for working with CellxGene (Cell by Gene) — the interactive single-cell data explorer and data portal maintained by the Chan Zuckerberg Initiative. Ultimate CellxGene Framework helps you discover, download, and analyze published single-cell datasets through the CellxGene Census API and Discover portal.
When to Use This Skill
Choose Ultimate CellxGene Framework when:
- Searching the CellxGene Discover portal for published single-cell datasets
- Downloading annotated scRNA-seq data via the CellxGene Census API
- Querying across hundreds of datasets for specific cell types or genes
- Building reference atlases from curated single-cell data
Consider alternatives when:
- You need to analyze your own unpublished data (use Scanpy directly)
- You need bulk RNA-seq data (use GEO or ArrayExpress)
- You want the CellxGene visualization tool locally (use
cellxgene launch) - You need spatial transcriptomics data (use Squidpy or SpatialData)
Quick Start
claude "Download human lung T cell data from CellxGene Census"
import cellxgene_census # Open the Census (latest release) census = cellxgene_census.open_soma() # Query for human lung T cells adata = cellxgene_census.get_anndata( census, organism="Homo sapiens", obs_value_filter=( "tissue_general == 'lung' and " "cell_type == 'T cell'" ), column_names={ "obs": ["cell_type", "tissue", "disease", "donor_id", "dataset_id", "assay"], "var": ["feature_name", "feature_id"] } ) print(f"Cells: {adata.n_obs}") print(f"Genes: {adata.n_vars}") print(f"Datasets: {adata.obs['dataset_id'].nunique()}") print(f"Cell types: {adata.obs['cell_type'].value_counts()}") census.close()
Core Concepts
CellxGene Census API
| Function | Purpose | Returns |
|---|---|---|
open_soma() | Connect to Census database | Census connection |
get_anndata() | Query and download data | AnnData object |
get_obs() | Query cell metadata only | DataFrame |
get_var() | Query gene metadata only | DataFrame |
get_presence_matrix() | Which genes in which datasets | Sparse matrix |
Query Filters
import cellxgene_census census = cellxgene_census.open_soma() # Filter by tissue and disease disease_cells = cellxgene_census.get_anndata( census, organism="Homo sapiens", obs_value_filter=( "disease != 'normal' and " "tissue_general == 'brain'" ) ) # Filter by specific cell type and assay specific = cellxgene_census.get_anndata( census, organism="Homo sapiens", obs_value_filter=( "cell_type == 'hepatocyte' and " "assay == '10x 3\\' v3'" ), var_value_filter="feature_name in ['ALB', 'AFP', 'HNF4A', 'CYP3A4']" ) # Get metadata without expression data (fast) obs_df = cellxgene_census.get_obs( census, organism="Homo sapiens", value_filter="tissue_general == 'heart'", column_names=["cell_type", "disease", "donor_id"] ) print(obs_df["cell_type"].value_counts()) census.close()
Building Reference Atlases
import cellxgene_census import scanpy as sc census = cellxgene_census.open_soma() # Download a tissue-specific reference atlas kidney_atlas = cellxgene_census.get_anndata( census, organism="Homo sapiens", obs_value_filter="tissue_general == 'kidney' and disease == 'normal'", column_names={ "obs": ["cell_type", "tissue", "donor_id", "dataset_id", "assay", "sex", "development_stage"], } ) # Standard analysis pipeline on downloaded data sc.pp.filter_cells(kidney_atlas, min_genes=200) sc.pp.normalize_total(kidney_atlas, target_sum=1e4) sc.pp.log1p(kidney_atlas) sc.pp.highly_variable_genes(kidney_atlas, n_top_genes=3000) sc.pp.pca(kidney_atlas) sc.pp.neighbors(kidney_atlas) sc.tl.umap(kidney_atlas) sc.tl.leiden(kidney_atlas, resolution=0.5) print(f"Atlas: {kidney_atlas.n_obs} cells, " f"{kidney_atlas.obs['cell_type'].nunique()} cell types") census.close()
Configuration
| Parameter | Description | Default |
|---|---|---|
census_version | Specific Census release | "latest" |
organism | Homo sapiens or Mus musculus | Required |
obs_value_filter | Cell-level filter expression | None (all cells) |
var_value_filter | Gene-level filter expression | None (all genes) |
column_names | Metadata columns to include | All columns |
Best Practices
-
Start with metadata queries before downloading expression data. Use
get_obs()to explore available tissues, cell types, and diseases before downloading the full expression matrix. This prevents downloading gigabytes of data you don't need. -
Filter to specific genes when possible. If you only need a handful of marker genes, use
var_value_filterto retrieve only those genes. This dramatically reduces download size and memory usage compared to fetching the full transcriptome. -
Account for batch effects across datasets. CellxGene Census aggregates data from many studies with different protocols. Use batch correction methods (Harmony, scVI, BBKNN) when combining cells from multiple
dataset_idvalues. Raw integration without correction produces misleading clusters. -
Use the Census version parameter for reproducibility. Pin a specific Census release version in your analysis scripts. The "latest" version updates as new datasets are added, which can change your results. For publications, always specify the exact version.
-
Close the Census connection when done. Always call
census.close()after your queries complete. The Census uses SOMA (Stack of Matrices, Annotated) format which maintains active connections. Unclosed connections may leak resources.
Common Issues
Download is extremely slow or runs out of memory. You're trying to download too many cells. Add more specific filters — tissue, cell type, disease, or assay — to reduce the query size. For large-scale analyses, use the TileDB-SOMA API directly which supports out-of-core processing.
Cell type annotations don't match expected labels. CellxGene uses a controlled vocabulary (Cell Ontology) that may differ from labels used in individual papers. Check the available cell types with get_obs() first and use the exact ontology terms in your filters.
Expression values seem unnormalized. CellxGene Census provides raw counts by default, not normalized values. Apply your own normalization pipeline (e.g., sc.pp.normalize_total() + sc.pp.log1p()) after downloading. This is by design — different analyses require different normalization strategies.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.