E

Ensembl Database Kit

Battle-tested skill for query, ensembl, genome, database. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Ensembl Database Kit

A scientific computing skill for querying Ensembl — the comprehensive genome database for vertebrates and other eukaryotes maintained by EMBL-EBI. Ensembl Database Kit helps you retrieve gene annotations, transcript variants, regulatory regions, and comparative genomics data through Ensembl's REST API and BioMart interface.

When to Use This Skill

Choose Ensembl Database Kit when:

  • Looking up gene coordinates, exon structures, and transcript variants
  • Retrieving ortholog/paralog information across species
  • Querying regulatory features (promoters, enhancers, TFBS)
  • Bulk downloading gene annotations via BioMart

Consider alternatives when:

  • You need raw sequencing data (use ENA or NCBI SRA)
  • You need clinical variant interpretation (use ClinVar)
  • You need protein function annotations (use UniProt)
  • You need non-vertebrate genomes (use Ensembl Genomes or NCBI)

Quick Start

claude "Look up the BRCA1 gene and all its transcript variants in Ensembl"
import requests # Ensembl REST API server = "https://rest.ensembl.org" # Look up gene by symbol response = requests.get( f"{server}/lookup/symbol/homo_sapiens/BRCA1", headers={"Content-Type": "application/json"}, params={"expand": 1} ) gene = response.json() print(f"Gene: {gene['display_name']}") print(f"Ensembl ID: {gene['id']}") print(f"Location: {gene['seq_region_name']}:{gene['start']}-{gene['end']}") print(f"Strand: {'+' if gene['strand'] == 1 else '-'}") print(f"Biotype: {gene['biotype']}") print(f"Transcripts: {len(gene.get('Transcript', []))}") for tx in gene.get("Transcript", []): print(f" {tx['id']} | {tx['biotype']} | {tx['length']} bp")

Core Concepts

Ensembl REST API Endpoints

EndpointPurposeExample
/lookup/id/{id}Look up by Ensembl IDENSG00000012048
/lookup/symbol/{species}/{symbol}Look up by gene symbolBRCA1
/sequence/id/{id}Get sequenceDNA, cDNA, protein
/overlap/region/{species}/{region}Features in regionGenes, transcripts, variants
/homology/id/{id}Orthologs/paralogsCross-species comparisons
/variation/{species}/{variant}Variant inforsID lookup
/regulatory/species/{species}/{id}Regulatory featuresPromoters, enhancers

BioMart Queries

from pybiomart import Server server = Server(host="http://www.ensembl.org") dataset = server.marts["ENSEMBL_MART_ENSEMBL"].datasets["hsapiens_gene_ensembl"] # Get gene annotations results = dataset.query( attributes=[ "ensembl_gene_id", "external_gene_name", "chromosome_name", "start_position", "end_position", "strand", "gene_biotype" ], filters={"chromosome_name": ["1", "2", "3"]} ) print(f"Genes on chr1-3: {len(results)}")

Comparative Genomics

# Get orthologs across species response = requests.get( f"{server}/homology/id/ENSG00000012048", headers={"Content-Type": "application/json"}, params={ "type": "orthologues", "target_taxon": "10090" # Mouse } ) homologies = response.json()["data"][0]["homologies"] for h in homologies: target = h["target"] print(f"Ortholog: {target['species']} - {target.get('id', 'N/A')}") print(f" Percent identity: {h.get('dn_ds', 'N/A')}")

Configuration

ParameterDescriptionDefault
serverEnsembl REST API base URLhttps://rest.ensembl.org
speciesDefault organismhomo_sapiens
assemblyGenome assembly versionGRCh38
content_typeResponse formatapplication/json
biomart_hostBioMart serverwww.ensembl.org

Best Practices

  1. Use Ensembl stable IDs for persistent references. Ensembl gene IDs (ENSG...) are versioned and stable across releases. Use these in publications and databases rather than gene symbols, which can be ambiguous or change over time.

  2. Check the Ensembl release version. Ensembl updates quarterly. Gene coordinates, annotations, and transcript models can change between releases. Note the release number when recording results for reproducibility.

  3. Use BioMart for bulk queries. For genome-wide data (all genes, all transcripts), use BioMart instead of individual REST API calls. BioMart is optimized for bulk retrieval and returns tabular data suitable for analysis.

  4. Rate limit REST API requests. Ensembl allows 15 requests per second. For batch lookups, add small delays or use the POST endpoint for multiple IDs in a single request.

  5. Use the GRCh37 archive for legacy coordinates. Some datasets use GRCh37 (hg19) coordinates. Access the GRCh37 version at grch37.rest.ensembl.org rather than converting coordinates, which can introduce errors.

Common Issues

Gene symbol not found. Gene symbols are species-specific and case-sensitive. Use BRCA1 for human, Brca1 for mouse. If the symbol isn't recognized, search by Ensembl ID or use the /xrefs endpoint to find the correct symbol.

REST API returns 429 Too Many Requests. You've exceeded the rate limit. Add time.sleep(0.1) between requests, or use POST endpoints to batch multiple queries into single requests. For large-scale analyses, use BioMart.

Transcript coordinates differ between databases. Ensembl and NCBI RefSeq may annotate different transcripts for the same gene. Discrepancies in exon boundaries are common. Specify which transcript annotation source you're using and stick to one system within an analysis.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates