Ensembl Database Kit

A scientific computing skill for querying Ensembl — the comprehensive genome database for vertebrates and other eukaryotes maintained by EMBL-EBI. Ensembl Database Kit helps you retrieve gene annotations, transcript variants, regulatory regions, and comparative genomics data through Ensembl's REST API and BioMart interface.

When to Use This Skill

Choose Ensembl Database Kit when:

Looking up gene coordinates, exon structures, and transcript variants
Retrieving ortholog/paralog information across species
Querying regulatory features (promoters, enhancers, TFBS)
Bulk downloading gene annotations via BioMart

Consider alternatives when:

You need raw sequencing data (use ENA or NCBI SRA)
You need clinical variant interpretation (use ClinVar)
You need protein function annotations (use UniProt)
You need non-vertebrate genomes (use Ensembl Genomes or NCBI)

Quick Start


claude "Look up the BRCA1 gene and all its transcript variants in Ensembl"


import requests

# Ensembl REST API
server = "https://rest.ensembl.org"

# Look up gene by symbol
response = requests.get(
    f"{server}/lookup/symbol/homo_sapiens/BRCA1",
    headers={"Content-Type": "application/json"},
    params={"expand": 1}
)
gene = response.json()

print(f"Gene: {gene['display_name']}")
print(f"Ensembl ID: {gene['id']}")
print(f"Location: {gene['seq_region_name']}:{gene['start']}-{gene['end']}")
print(f"Strand: {'+' if gene['strand'] == 1 else '-'}")
print(f"Biotype: {gene['biotype']}")
print(f"Transcripts: {len(gene.get('Transcript', []))}")

for tx in gene.get("Transcript", []):
    print(f"  {tx['id']} | {tx['biotype']} | {tx['length']} bp")

Core Concepts

Ensembl REST API Endpoints

Endpoint	Purpose	Example
`/lookup/id/{id}`	Look up by Ensembl ID	`ENSG00000012048`
`/lookup/symbol/{species}/{symbol}`	Look up by gene symbol	`BRCA1`
`/sequence/id/{id}`	Get sequence	DNA, cDNA, protein
`/overlap/region/{species}/{region}`	Features in region	Genes, transcripts, variants
`/homology/id/{id}`	Orthologs/paralogs	Cross-species comparisons
`/variation/{species}/{variant}`	Variant info	rsID lookup
`/regulatory/species/{species}/{id}`	Regulatory features	Promoters, enhancers

BioMart Queries


from pybiomart import Server

server = Server(host="http://www.ensembl.org")
dataset = server.marts["ENSEMBL_MART_ENSEMBL"].datasets["hsapiens_gene_ensembl"]

# Get gene annotations
results = dataset.query(
    attributes=[
        "ensembl_gene_id",
        "external_gene_name",
        "chromosome_name",
        "start_position",
        "end_position",
        "strand",
        "gene_biotype"
    ],
    filters={"chromosome_name": ["1", "2", "3"]}
)
print(f"Genes on chr1-3: {len(results)}")

Comparative Genomics


# Get orthologs across species
response = requests.get(
    f"{server}/homology/id/ENSG00000012048",
    headers={"Content-Type": "application/json"},
    params={
        "type": "orthologues",
        "target_taxon": "10090"  # Mouse
    }
)
homologies = response.json()["data"][0]["homologies"]

for h in homologies:
    target = h["target"]
    print(f"Ortholog: {target['species']} - {target.get('id', 'N/A')}")
    print(f"  Percent identity: {h.get('dn_ds', 'N/A')}")

Configuration

Parameter	Description	Default
`server`	Ensembl REST API base URL	`https://rest.ensembl.org`
`species`	Default organism	`homo_sapiens`
`assembly`	Genome assembly version	`GRCh38`
`content_type`	Response format	`application/json`
`biomart_host`	BioMart server	`www.ensembl.org`

Best Practices

Use Ensembl stable IDs for persistent references. Ensembl gene IDs (ENSG...) are versioned and stable across releases. Use these in publications and databases rather than gene symbols, which can be ambiguous or change over time.
Check the Ensembl release version. Ensembl updates quarterly. Gene coordinates, annotations, and transcript models can change between releases. Note the release number when recording results for reproducibility.
Use BioMart for bulk queries. For genome-wide data (all genes, all transcripts), use BioMart instead of individual REST API calls. BioMart is optimized for bulk retrieval and returns tabular data suitable for analysis.
Rate limit REST API requests. Ensembl allows 15 requests per second. For batch lookups, add small delays or use the POST endpoint for multiple IDs in a single request.
Use the GRCh37 archive for legacy coordinates. Some datasets use GRCh37 (hg19) coordinates. Access the GRCh37 version at grch37.rest.ensembl.org rather than converting coordinates, which can introduce errors.

Common Issues

Gene symbol not found. Gene symbols are species-specific and case-sensitive. Use BRCA1 for human, Brca1 for mouse. If the symbol isn't recognized, search by Ensembl ID or use the /xrefs endpoint to find the correct symbol.

REST API returns 429 Too Many Requests. You've exceeded the rate limit. Add time.sleep(0.1) between requests, or use POST endpoints to batch multiple queries into single requests. For large-scale analyses, use BioMart.

Transcript coordinates differ between databases. Ensembl and NCBI RefSeq may annotate different transcripts for the same gene. Discrepancies in exon boundaries are common. Specify which transcript annotation source you're using and stick to one system within an analysis.

⚠️ Loading Issue

Ensembl Database Kit

Ensembl Database Kit

When to Use This Skill

Quick Start

Core Concepts

Ensembl REST API Endpoints

BioMart Queries

Comparative Genomics

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace