Ena Database Studio
Battle-tested skill for access, european, nucleotide, archive. Includes structured workflows, validation checks, and reusable patterns for scientific.
ENA Database Studio
A scientific computing skill for accessing the European Nucleotide Archive (ENA) — the comprehensive public repository for nucleotide sequence data maintained by EMBL-EBI. ENA Database Studio helps you search for sequencing experiments, retrieve raw reads, and download assembled sequences across all domains of life.
When to Use This Skill
Choose ENA Database Studio when:
- Searching for raw sequencing data (FASTQ, BAM) by study or organism
- Retrieving assembled genomes, transcriptomes, or metagenomes
- Finding sequencing metadata (instrument, library, sample info)
- Downloading bulk sequence data for reanalysis projects
Consider alternatives when:
- You need US-centric data submission (use NCBI SRA)
- You need protein sequences (use UniProt)
- You need annotated genomes (use Ensembl)
- You need published analysis results (use GEO or ArrayExpress)
Quick Start
claude "Search ENA for SARS-CoV-2 whole genome sequencing data"
import requests # ENA Portal API base_url = "https://www.ebi.ac.uk/ena/portal/api/search" params = { "result": "read_run", "query": "tax_tree(2697049) AND library_strategy=\"WGS\"", "fields": "run_accession,experiment_title,instrument_model,read_count,base_count,fastq_ftp", "format": "json", "limit": 10, "sortFields": "first_public", "sortOrder": "desc" } response = requests.get(base_url, params=params) runs = response.json() for run in runs: print(f"Accession: {run['run_accession']}") print(f" Title: {run['experiment_title'][:60]}...") print(f" Instrument: {run['instrument_model']}") print(f" Reads: {int(run['read_count']):,}") print(f" FTP: {run['fastq_ftp'][:60]}...")
Core Concepts
ENA Data Types
| Result Type | Description | Key Fields |
|---|---|---|
read_run | Raw sequencing reads | FASTQ, instrument, reads |
study | Research project/study | Title, abstract, publication |
sample | Biological sample | Organism, tissue, collection |
assembly | Genome assemblies | Contigs, scaffolds, quality |
sequence | Annotated sequences | Features, annotations |
analysis | Processed analysis results | Methods, derived data |
Query Syntax
# ENA search queries # By taxonomy (SARS-CoV-2 and descendants) query = "tax_tree(2697049)" # By organism name query = 'tax_name("Homo sapiens")' # By library strategy query = 'library_strategy="RNA-Seq"' # By instrument query = 'instrument_model="Illumina NovaSeq 6000"' # Combined queries query = ( 'tax_tree(9606) AND ' 'library_strategy="RNA-Seq" AND ' 'instrument_model="Illumina NovaSeq*" AND ' 'first_public>=2024-01-01' )
Bulk Download
# Download FASTQ files from ENA # Using the FTP links from API results # Single file wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/SRR1234567/SRR1234567_1.fastq.gz # Using enaBrowserTools for bulk download pip install enaBrowserTools enaGroupGet -g read_run -f fastq PRJNA12345 # Download by accession list while read acc; do wget "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/${acc:0:6}/${acc}/${acc}_1.fastq.gz" done < accession_list.txt
Configuration
| Parameter | Description | Default |
|---|---|---|
result_type | ENA result type to search | read_run |
format | Response format (json, tsv, xml) | json |
limit | Max results per query | 0 (all) |
fields | Specific fields to return | Type-specific defaults |
download_format | Data format (fastq, submitted) | fastq |
Best Practices
-
Use taxonomy tree searches for comprehensive results.
tax_tree(9606)captures human data from all subspecies, whiletax_eq(9606)matches exact taxonomy ID only. Tree searches ensure you don't miss relevant samples classified under subspecific taxa. -
Request only needed fields. The
fieldsparameter lets you select specific metadata columns. This dramatically reduces response size for large result sets and speeds up parsing. -
Download via Aspera for large datasets. ENA supports Aspera (
ascp) downloads which are 5-10x faster than FTP/HTTP for large files. Install IBM Aspera Connect and use thefaspURLs from API results. -
Check data availability before large downloads. Some ENA records reference data stored at NCBI or DDBJ. Check the
submitted_ftpandfastq_ftpfields — empty fields mean the FASTQ files aren't directly available from ENA and may need to be retrieved from SRA. -
Use study accessions for reproducibility. Reference data by study accession (PRJNA/ERP) rather than individual run accessions in publications. Study accessions group all related data and are stable identifiers for citation.
Common Issues
Search returns no results for known accessions. Check the result type — searching read_run won't find assembly accessions. Match the result type to your query: run accessions (SRR/ERR) use read_run, study accessions (PRJNA/ERP) use study.
FASTQ download links are broken. ENA periodically reorganizes its FTP structure. Use the API to get current FTP paths rather than constructing URLs manually. If FTP links fail, try the alternative submitted_ftp or sra_ftp fields.
Downloaded FASTQ files are corrupted. Verify file integrity with md5sum against the checksums in the fastq_md5 API field. Large downloads over FTP are susceptible to interruption. Use wget -c for resumable downloads or Aspera for reliable transfer.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.