ENA Database Studio

A scientific computing skill for accessing the European Nucleotide Archive (ENA) — the comprehensive public repository for nucleotide sequence data maintained by EMBL-EBI. ENA Database Studio helps you search for sequencing experiments, retrieve raw reads, and download assembled sequences across all domains of life.

When to Use This Skill

Choose ENA Database Studio when:

Searching for raw sequencing data (FASTQ, BAM) by study or organism
Retrieving assembled genomes, transcriptomes, or metagenomes
Finding sequencing metadata (instrument, library, sample info)
Downloading bulk sequence data for reanalysis projects

Consider alternatives when:

You need US-centric data submission (use NCBI SRA)
You need protein sequences (use UniProt)
You need annotated genomes (use Ensembl)
You need published analysis results (use GEO or ArrayExpress)

Quick Start


claude "Search ENA for SARS-CoV-2 whole genome sequencing data"


import requests

# ENA Portal API
base_url = "https://www.ebi.ac.uk/ena/portal/api/search"

params = {
    "result": "read_run",
    "query": "tax_tree(2697049) AND library_strategy=\"WGS\"",
    "fields": "run_accession,experiment_title,instrument_model,read_count,base_count,fastq_ftp",
    "format": "json",
    "limit": 10,
    "sortFields": "first_public",
    "sortOrder": "desc"
}

response = requests.get(base_url, params=params)
runs = response.json()

for run in runs:
    print(f"Accession: {run['run_accession']}")
    print(f"  Title: {run['experiment_title'][:60]}...")
    print(f"  Instrument: {run['instrument_model']}")
    print(f"  Reads: {int(run['read_count']):,}")
    print(f"  FTP: {run['fastq_ftp'][:60]}...")

Core Concepts

ENA Data Types

Result Type	Description	Key Fields
`read_run`	Raw sequencing reads	FASTQ, instrument, reads
`study`	Research project/study	Title, abstract, publication
`sample`	Biological sample	Organism, tissue, collection
`assembly`	Genome assemblies	Contigs, scaffolds, quality
`sequence`	Annotated sequences	Features, annotations
`analysis`	Processed analysis results	Methods, derived data

Query Syntax


# ENA search queries

# By taxonomy (SARS-CoV-2 and descendants)
query = "tax_tree(2697049)"

# By organism name
query = 'tax_name("Homo sapiens")'

# By library strategy
query = 'library_strategy="RNA-Seq"'

# By instrument
query = 'instrument_model="Illumina NovaSeq 6000"'

# Combined queries
query = (
    'tax_tree(9606) AND '
    'library_strategy="RNA-Seq" AND '
    'instrument_model="Illumina NovaSeq*" AND '
    'first_public>=2024-01-01'
)

Bulk Download


# Download FASTQ files from ENA
# Using the FTP links from API results

# Single file
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/SRR1234567/SRR1234567_1.fastq.gz

# Using enaBrowserTools for bulk download
pip install enaBrowserTools
enaGroupGet -g read_run -f fastq PRJNA12345

# Download by accession list
while read acc; do
  wget "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/${acc:0:6}/${acc}/${acc}_1.fastq.gz"
done < accession_list.txt

Configuration

Parameter	Description	Default
`result_type`	ENA result type to search	`read_run`
`format`	Response format (json, tsv, xml)	`json`
`limit`	Max results per query	`0` (all)
`fields`	Specific fields to return	Type-specific defaults
`download_format`	Data format (fastq, submitted)	`fastq`

Best Practices

Use taxonomy tree searches for comprehensive results. tax_tree(9606) captures human data from all subspecies, while tax_eq(9606) matches exact taxonomy ID only. Tree searches ensure you don't miss relevant samples classified under subspecific taxa.
Request only needed fields. The fields parameter lets you select specific metadata columns. This dramatically reduces response size for large result sets and speeds up parsing.
Download via Aspera for large datasets. ENA supports Aspera (ascp) downloads which are 5-10x faster than FTP/HTTP for large files. Install IBM Aspera Connect and use the fasp URLs from API results.
Check data availability before large downloads. Some ENA records reference data stored at NCBI or DDBJ. Check the submitted_ftp and fastq_ftp fields — empty fields mean the FASTQ files aren't directly available from ENA and may need to be retrieved from SRA.
Use study accessions for reproducibility. Reference data by study accession (PRJNA/ERP) rather than individual run accessions in publications. Study accessions group all related data and are stable identifiers for citation.

Common Issues

Search returns no results for known accessions. Check the result type — searching read_run won't find assembly accessions. Match the result type to your query: run accessions (SRR/ERR) use read_run, study accessions (PRJNA/ERP) use study.

FASTQ download links are broken. ENA periodically reorganizes its FTP structure. Use the API to get current FTP paths rather than constructing URLs manually. If FTP links fail, try the alternative submitted_ftp or sra_ftp fields.

Downloaded FASTQ files are corrupted. Verify file integrity with md5sum against the checksums in the fastq_md5 API field. Large downloads over FTP are susceptible to interruption. Use wget -c for resumable downloads or Aspera for reliable transfer.

⚠️ Loading Issue

Ena Database Studio

ENA Database Studio

When to Use This Skill

Quick Start

Core Concepts

ENA Data Types

Query Syntax

Bulk Download

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace