P

Pro Gene Workspace

Powerful skill for query, ncbi, gene, utilities. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Pro Gene Workspace

A scientific computing skill for gene-centric bioinformatics analysis — retrieving gene information, annotations, expression data, and functional characterizations from major genomics databases. Pro Gene Workspace provides a unified workflow for investigating gene function across NCBI Gene, Ensembl, UniProt, and Gene Ontology.

When to Use This Skill

Choose Pro Gene Workspace when:

  • Looking up comprehensive gene information across databases
  • Retrieving gene annotations, GO terms, and pathway memberships
  • Analyzing gene expression patterns across tissues and conditions
  • Building gene-centric reports for research or clinical interpretation

Consider alternatives when:

  • You need variant-level data (use ClinVar, gnomAD)
  • You need protein structure data (use PDB, AlphaFold)
  • You need single-cell expression data (use CellxGene)
  • You need genome-wide analysis (use Ensembl BioMart for bulk queries)

Quick Start

claude "Get comprehensive information about the TP53 gene"
from Bio import Entrez import requests Entrez.email = "[email protected]" # NCBI Gene search handle = Entrez.esearch(db="gene", term="TP53[gene] AND Homo sapiens[orgn]") results = Entrez.read(handle) gene_id = results["IdList"][0] # Get gene summary handle = Entrez.efetch(db="gene", id=gene_id, rettype="gene_table", retmode="text") gene_info = handle.read() print(gene_info[:500]) # UniProt annotations uniprot_url = "https://rest.uniprot.org/uniprotkb/search" response = requests.get(uniprot_url, params={ "query": "gene:TP53 AND organism_id:9606 AND reviewed:true", "format": "json", "fields": "accession,gene_names,protein_name,go_p,go_f,go_c,length" }) protein = response.json()["results"][0] print(f"\nProtein: {protein['proteinDescription']['recommendedName']['fullName']['value']}") print(f"UniProt: {protein['primaryAccession']}") print(f"Length: {protein['sequence']['length']} aa")

Core Concepts

Gene Information Sources

DatabaseFocusKey Data
NCBI GeneGene-centric aggregationSummary, references, homologs
EnsemblGenomic annotationsCoordinates, transcripts, regulation
UniProtProtein annotationsFunction, GO terms, domains
Gene OntologyFunctional classificationMolecular function, process, component
GTExExpression across tissuesTPM values, eQTLs
OMIMDisease associationsPhenotype-gene relationships

Multi-Database Gene Report

def gene_report(gene_symbol, species="Homo sapiens"): """Compile gene information from multiple databases""" report = {"symbol": gene_symbol} # NCBI Gene handle = Entrez.esearch(db="gene", term=f"{gene_symbol}[gene] AND {species}[orgn]") results = Entrez.read(handle) if results["IdList"]: report["ncbi_gene_id"] = results["IdList"][0] # Ensembl ens_resp = requests.get( f"https://rest.ensembl.org/lookup/symbol/homo_sapiens/{gene_symbol}", headers={"Content-Type": "application/json"} ) if ens_resp.ok: ens = ens_resp.json() report["ensembl_id"] = ens["id"] report["location"] = f"chr{ens['seq_region_name']}:{ens['start']}-{ens['end']}" report["biotype"] = ens["biotype"] # UniProt up_resp = requests.get( "https://rest.uniprot.org/uniprotkb/search", params={ "query": f"gene:{gene_symbol} AND organism_id:9606 AND reviewed:true", "format": "json", "fields": "accession,protein_name,go_p,length" } ) if up_resp.ok and up_resp.json()["results"]: up = up_resp.json()["results"][0] report["uniprot_id"] = up["primaryAccession"] report["protein_length"] = up["sequence"]["length"] return report tp53 = gene_report("TP53")

Gene Ontology Analysis

# GO term enrichment using goatools from goatools.go_enrichment import GOEnrichmentStudy from goatools.obo_parser import GODag # Load GO DAG obo_dag = GODag("go-basic.obo") # Run enrichment analysis gene_list = ["TP53", "BRCA1", "MDM2", "CDKN2A", "RB1"] # Study genes background = all_genes # All genes in genome goe = GOEnrichmentStudy( background, gene2go, # Gene-to-GO mapping obo_dag, methods=["fdr_bh"] ) results = goe.run_study(gene_list) significant = [r for r in results if r.p_fdr_bh < 0.05] for r in sorted(significant, key=lambda x: x.p_fdr_bh)[:10]: print(f"{r.GO}: {r.name} (FDR={r.p_fdr_bh:.4f})")

Configuration

ParameterDescriptionDefault
speciesTarget organismHomo sapiens
databasesSources to query[ncbi, ensembl, uniprot]
include_goRetrieve GO annotationstrue
include_expressionRetrieve GTEx expressionfalse
enrichment_methodGO enrichment p-value correctionfdr_bh

Best Practices

  1. Cross-reference across databases. No single database has complete gene information. Combine NCBI Gene (summary, references), Ensembl (coordinates, transcripts), and UniProt (protein function, domains) for a comprehensive picture.

  2. Use approved gene symbols. HGNC (Hugo Gene Nomenclature Committee) maintains the official gene naming standard. Use approved symbols to avoid confusion from aliases — "p53" might not resolve correctly, but "TP53" will.

  3. Check gene aliases for database mismatches. The same gene may have different names or IDs across databases. Use NCBI Gene's alias list or UniProt's gene name mappings to resolve discrepancies.

  4. Include tissue-specific context. A gene's function varies by tissue. Include GTEx expression data to understand where the gene is active, which is critical for interpreting disease associations and drug target potential.

  5. Use GO enrichment with appropriate backgrounds. GO enrichment requires a background gene set. Use all expressed genes (not all genes in the genome) as background to avoid inflating significance. The choice of background dramatically affects results.

Common Issues

Gene symbol maps to multiple Ensembl IDs. Some gene symbols refer to readthrough transcripts or pseudogenes that have separate Ensembl IDs. Filter by biotype: protein_coding to focus on the primary gene, and verify the chromosomal location matches expectations.

GO enrichment returns no significant terms. Common causes: gene list too small (<10 genes), inappropriate background set, or genes don't share functional themes. Try relaxing the FDR threshold or using a different enrichment method.

Expression data varies between databases. GTEx, Human Protein Atlas, and NCBI GEO may show different expression patterns due to different sample preparations, normalization methods, and tissue definitions. Note the data source and version when reporting expression data.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates