Pro Gene Workspace

A scientific computing skill for gene-centric bioinformatics analysis — retrieving gene information, annotations, expression data, and functional characterizations from major genomics databases. Pro Gene Workspace provides a unified workflow for investigating gene function across NCBI Gene, Ensembl, UniProt, and Gene Ontology.

When to Use This Skill

Choose Pro Gene Workspace when:

Looking up comprehensive gene information across databases
Retrieving gene annotations, GO terms, and pathway memberships
Analyzing gene expression patterns across tissues and conditions
Building gene-centric reports for research or clinical interpretation

Consider alternatives when:

You need variant-level data (use ClinVar, gnomAD)
You need protein structure data (use PDB, AlphaFold)
You need single-cell expression data (use CellxGene)
You need genome-wide analysis (use Ensembl BioMart for bulk queries)

Quick Start


claude "Get comprehensive information about the TP53 gene"


from Bio import Entrez
import requests

Entrez.email = "[email protected]"

# NCBI Gene search
handle = Entrez.esearch(db="gene", term="TP53[gene] AND Homo sapiens[orgn]")
results = Entrez.read(handle)
gene_id = results["IdList"][0]

# Get gene summary
handle = Entrez.efetch(db="gene", id=gene_id, rettype="gene_table", retmode="text")
gene_info = handle.read()
print(gene_info[:500])

# UniProt annotations
uniprot_url = "https://rest.uniprot.org/uniprotkb/search"
response = requests.get(uniprot_url, params={
    "query": "gene:TP53 AND organism_id:9606 AND reviewed:true",
    "format": "json",
    "fields": "accession,gene_names,protein_name,go_p,go_f,go_c,length"
})

protein = response.json()["results"][0]
print(f"\nProtein: {protein['proteinDescription']['recommendedName']['fullName']['value']}")
print(f"UniProt: {protein['primaryAccession']}")
print(f"Length: {protein['sequence']['length']} aa")

Core Concepts

Gene Information Sources

Database	Focus	Key Data
NCBI Gene	Gene-centric aggregation	Summary, references, homologs
Ensembl	Genomic annotations	Coordinates, transcripts, regulation
UniProt	Protein annotations	Function, GO terms, domains
Gene Ontology	Functional classification	Molecular function, process, component
GTEx	Expression across tissues	TPM values, eQTLs
OMIM	Disease associations	Phenotype-gene relationships

Multi-Database Gene Report


def gene_report(gene_symbol, species="Homo sapiens"):
    """Compile gene information from multiple databases"""
    report = {"symbol": gene_symbol}

    # NCBI Gene
    handle = Entrez.esearch(db="gene",
                           term=f"{gene_symbol}[gene] AND {species}[orgn]")
    results = Entrez.read(handle)
    if results["IdList"]:
        report["ncbi_gene_id"] = results["IdList"][0]

    # Ensembl
    ens_resp = requests.get(
        f"https://rest.ensembl.org/lookup/symbol/homo_sapiens/{gene_symbol}",
        headers={"Content-Type": "application/json"}
    )
    if ens_resp.ok:
        ens = ens_resp.json()
        report["ensembl_id"] = ens["id"]
        report["location"] = f"chr{ens['seq_region_name']}:{ens['start']}-{ens['end']}"
        report["biotype"] = ens["biotype"]

    # UniProt
    up_resp = requests.get(
        "https://rest.uniprot.org/uniprotkb/search",
        params={
            "query": f"gene:{gene_symbol} AND organism_id:9606 AND reviewed:true",
            "format": "json",
            "fields": "accession,protein_name,go_p,length"
        }
    )
    if up_resp.ok and up_resp.json()["results"]:
        up = up_resp.json()["results"][0]
        report["uniprot_id"] = up["primaryAccession"]
        report["protein_length"] = up["sequence"]["length"]

    return report

tp53 = gene_report("TP53")

Gene Ontology Analysis


# GO term enrichment using goatools
from goatools.go_enrichment import GOEnrichmentStudy
from goatools.obo_parser import GODag

# Load GO DAG
obo_dag = GODag("go-basic.obo")

# Run enrichment analysis
gene_list = ["TP53", "BRCA1", "MDM2", "CDKN2A", "RB1"]  # Study genes
background = all_genes  # All genes in genome

goe = GOEnrichmentStudy(
    background,
    gene2go,  # Gene-to-GO mapping
    obo_dag,
    methods=["fdr_bh"]
)

results = goe.run_study(gene_list)
significant = [r for r in results if r.p_fdr_bh < 0.05]
for r in sorted(significant, key=lambda x: x.p_fdr_bh)[:10]:
    print(f"{r.GO}: {r.name} (FDR={r.p_fdr_bh:.4f})")

Configuration

Parameter	Description	Default
`species`	Target organism	`Homo sapiens`
`databases`	Sources to query	`[ncbi, ensembl, uniprot]`
`include_go`	Retrieve GO annotations	`true`
`include_expression`	Retrieve GTEx expression	`false`
`enrichment_method`	GO enrichment p-value correction	`fdr_bh`

Best Practices

Cross-reference across databases. No single database has complete gene information. Combine NCBI Gene (summary, references), Ensembl (coordinates, transcripts), and UniProt (protein function, domains) for a comprehensive picture.
Use approved gene symbols. HGNC (Hugo Gene Nomenclature Committee) maintains the official gene naming standard. Use approved symbols to avoid confusion from aliases — "p53" might not resolve correctly, but "TP53" will.
Check gene aliases for database mismatches. The same gene may have different names or IDs across databases. Use NCBI Gene's alias list or UniProt's gene name mappings to resolve discrepancies.
Include tissue-specific context. A gene's function varies by tissue. Include GTEx expression data to understand where the gene is active, which is critical for interpreting disease associations and drug target potential.
Use GO enrichment with appropriate backgrounds. GO enrichment requires a background gene set. Use all expressed genes (not all genes in the genome) as background to avoid inflating significance. The choice of background dramatically affects results.

Common Issues

Gene symbol maps to multiple Ensembl IDs. Some gene symbols refer to readthrough transcripts or pseudogenes that have separate Ensembl IDs. Filter by biotype: protein_coding to focus on the primary gene, and verify the chromosomal location matches expectations.

GO enrichment returns no significant terms. Common causes: gene list too small (<10 genes), inappropriate background set, or genes don't share functional themes. Try relaxing the FDR threshold or using a different enrichment method.

Expression data varies between databases. GTEx, Human Protein Atlas, and NCBI GEO may show different expression patterns due to different sample preparations, normalization methods, and tissue definitions. Note the data source and version when reporting expression data.

⚠️ Loading Issue

Pro Gene Workspace

Pro Gene Workspace

When to Use This Skill

Quick Start

Core Concepts

Gene Information Sources

Multi-Database Gene Report

Gene Ontology Analysis

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace