U

Ultimate Cellxgene Framework

Production-ready skill that handles query, cellxgene, census, cells. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Ultimate CellxGene Framework

A scientific computing skill for working with CellxGene (Cell by Gene) — the interactive single-cell data explorer and data portal maintained by the Chan Zuckerberg Initiative. Ultimate CellxGene Framework helps you discover, download, and analyze published single-cell datasets through the CellxGene Census API and Discover portal.

When to Use This Skill

Choose Ultimate CellxGene Framework when:

  • Searching the CellxGene Discover portal for published single-cell datasets
  • Downloading annotated scRNA-seq data via the CellxGene Census API
  • Querying across hundreds of datasets for specific cell types or genes
  • Building reference atlases from curated single-cell data

Consider alternatives when:

  • You need to analyze your own unpublished data (use Scanpy directly)
  • You need bulk RNA-seq data (use GEO or ArrayExpress)
  • You want the CellxGene visualization tool locally (use cellxgene launch)
  • You need spatial transcriptomics data (use Squidpy or SpatialData)

Quick Start

claude "Download human lung T cell data from CellxGene Census"
import cellxgene_census # Open the Census (latest release) census = cellxgene_census.open_soma() # Query for human lung T cells adata = cellxgene_census.get_anndata( census, organism="Homo sapiens", obs_value_filter=( "tissue_general == 'lung' and " "cell_type == 'T cell'" ), column_names={ "obs": ["cell_type", "tissue", "disease", "donor_id", "dataset_id", "assay"], "var": ["feature_name", "feature_id"] } ) print(f"Cells: {adata.n_obs}") print(f"Genes: {adata.n_vars}") print(f"Datasets: {adata.obs['dataset_id'].nunique()}") print(f"Cell types: {adata.obs['cell_type'].value_counts()}") census.close()

Core Concepts

CellxGene Census API

FunctionPurposeReturns
open_soma()Connect to Census databaseCensus connection
get_anndata()Query and download dataAnnData object
get_obs()Query cell metadata onlyDataFrame
get_var()Query gene metadata onlyDataFrame
get_presence_matrix()Which genes in which datasetsSparse matrix

Query Filters

import cellxgene_census census = cellxgene_census.open_soma() # Filter by tissue and disease disease_cells = cellxgene_census.get_anndata( census, organism="Homo sapiens", obs_value_filter=( "disease != 'normal' and " "tissue_general == 'brain'" ) ) # Filter by specific cell type and assay specific = cellxgene_census.get_anndata( census, organism="Homo sapiens", obs_value_filter=( "cell_type == 'hepatocyte' and " "assay == '10x 3\\' v3'" ), var_value_filter="feature_name in ['ALB', 'AFP', 'HNF4A', 'CYP3A4']" ) # Get metadata without expression data (fast) obs_df = cellxgene_census.get_obs( census, organism="Homo sapiens", value_filter="tissue_general == 'heart'", column_names=["cell_type", "disease", "donor_id"] ) print(obs_df["cell_type"].value_counts()) census.close()

Building Reference Atlases

import cellxgene_census import scanpy as sc census = cellxgene_census.open_soma() # Download a tissue-specific reference atlas kidney_atlas = cellxgene_census.get_anndata( census, organism="Homo sapiens", obs_value_filter="tissue_general == 'kidney' and disease == 'normal'", column_names={ "obs": ["cell_type", "tissue", "donor_id", "dataset_id", "assay", "sex", "development_stage"], } ) # Standard analysis pipeline on downloaded data sc.pp.filter_cells(kidney_atlas, min_genes=200) sc.pp.normalize_total(kidney_atlas, target_sum=1e4) sc.pp.log1p(kidney_atlas) sc.pp.highly_variable_genes(kidney_atlas, n_top_genes=3000) sc.pp.pca(kidney_atlas) sc.pp.neighbors(kidney_atlas) sc.tl.umap(kidney_atlas) sc.tl.leiden(kidney_atlas, resolution=0.5) print(f"Atlas: {kidney_atlas.n_obs} cells, " f"{kidney_atlas.obs['cell_type'].nunique()} cell types") census.close()

Configuration

ParameterDescriptionDefault
census_versionSpecific Census release"latest"
organismHomo sapiens or Mus musculusRequired
obs_value_filterCell-level filter expressionNone (all cells)
var_value_filterGene-level filter expressionNone (all genes)
column_namesMetadata columns to includeAll columns

Best Practices

  1. Start with metadata queries before downloading expression data. Use get_obs() to explore available tissues, cell types, and diseases before downloading the full expression matrix. This prevents downloading gigabytes of data you don't need.

  2. Filter to specific genes when possible. If you only need a handful of marker genes, use var_value_filter to retrieve only those genes. This dramatically reduces download size and memory usage compared to fetching the full transcriptome.

  3. Account for batch effects across datasets. CellxGene Census aggregates data from many studies with different protocols. Use batch correction methods (Harmony, scVI, BBKNN) when combining cells from multiple dataset_id values. Raw integration without correction produces misleading clusters.

  4. Use the Census version parameter for reproducibility. Pin a specific Census release version in your analysis scripts. The "latest" version updates as new datasets are added, which can change your results. For publications, always specify the exact version.

  5. Close the Census connection when done. Always call census.close() after your queries complete. The Census uses SOMA (Stack of Matrices, Annotated) format which maintains active connections. Unclosed connections may leak resources.

Common Issues

Download is extremely slow or runs out of memory. You're trying to download too many cells. Add more specific filters — tissue, cell type, disease, or assay — to reduce the query size. For large-scale analyses, use the TileDB-SOMA API directly which supports out-of-core processing.

Cell type annotations don't match expected labels. CellxGene uses a controlled vocabulary (Cell Ontology) that may differ from labels used in individual papers. Check the available cell types with get_obs() first and use the exact ontology terms in your filters.

Expression values seem unnormalized. CellxGene Census provides raw counts by default, not normalized values. Apply your own normalization pipeline (e.g., sc.pp.normalize_total() + sc.pp.log1p()) after downloading. This is by design — different analyses require different normalization strategies.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates