E

Ena Database Studio

Battle-tested skill for access, european, nucleotide, archive. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

ENA Database Studio

A scientific computing skill for accessing the European Nucleotide Archive (ENA) — the comprehensive public repository for nucleotide sequence data maintained by EMBL-EBI. ENA Database Studio helps you search for sequencing experiments, retrieve raw reads, and download assembled sequences across all domains of life.

When to Use This Skill

Choose ENA Database Studio when:

  • Searching for raw sequencing data (FASTQ, BAM) by study or organism
  • Retrieving assembled genomes, transcriptomes, or metagenomes
  • Finding sequencing metadata (instrument, library, sample info)
  • Downloading bulk sequence data for reanalysis projects

Consider alternatives when:

  • You need US-centric data submission (use NCBI SRA)
  • You need protein sequences (use UniProt)
  • You need annotated genomes (use Ensembl)
  • You need published analysis results (use GEO or ArrayExpress)

Quick Start

claude "Search ENA for SARS-CoV-2 whole genome sequencing data"
import requests # ENA Portal API base_url = "https://www.ebi.ac.uk/ena/portal/api/search" params = { "result": "read_run", "query": "tax_tree(2697049) AND library_strategy=\"WGS\"", "fields": "run_accession,experiment_title,instrument_model,read_count,base_count,fastq_ftp", "format": "json", "limit": 10, "sortFields": "first_public", "sortOrder": "desc" } response = requests.get(base_url, params=params) runs = response.json() for run in runs: print(f"Accession: {run['run_accession']}") print(f" Title: {run['experiment_title'][:60]}...") print(f" Instrument: {run['instrument_model']}") print(f" Reads: {int(run['read_count']):,}") print(f" FTP: {run['fastq_ftp'][:60]}...")

Core Concepts

ENA Data Types

Result TypeDescriptionKey Fields
read_runRaw sequencing readsFASTQ, instrument, reads
studyResearch project/studyTitle, abstract, publication
sampleBiological sampleOrganism, tissue, collection
assemblyGenome assembliesContigs, scaffolds, quality
sequenceAnnotated sequencesFeatures, annotations
analysisProcessed analysis resultsMethods, derived data

Query Syntax

# ENA search queries # By taxonomy (SARS-CoV-2 and descendants) query = "tax_tree(2697049)" # By organism name query = 'tax_name("Homo sapiens")' # By library strategy query = 'library_strategy="RNA-Seq"' # By instrument query = 'instrument_model="Illumina NovaSeq 6000"' # Combined queries query = ( 'tax_tree(9606) AND ' 'library_strategy="RNA-Seq" AND ' 'instrument_model="Illumina NovaSeq*" AND ' 'first_public>=2024-01-01' )

Bulk Download

# Download FASTQ files from ENA # Using the FTP links from API results # Single file wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/SRR1234567/SRR1234567_1.fastq.gz # Using enaBrowserTools for bulk download pip install enaBrowserTools enaGroupGet -g read_run -f fastq PRJNA12345 # Download by accession list while read acc; do wget "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/${acc:0:6}/${acc}/${acc}_1.fastq.gz" done < accession_list.txt

Configuration

ParameterDescriptionDefault
result_typeENA result type to searchread_run
formatResponse format (json, tsv, xml)json
limitMax results per query0 (all)
fieldsSpecific fields to returnType-specific defaults
download_formatData format (fastq, submitted)fastq

Best Practices

  1. Use taxonomy tree searches for comprehensive results. tax_tree(9606) captures human data from all subspecies, while tax_eq(9606) matches exact taxonomy ID only. Tree searches ensure you don't miss relevant samples classified under subspecific taxa.

  2. Request only needed fields. The fields parameter lets you select specific metadata columns. This dramatically reduces response size for large result sets and speeds up parsing.

  3. Download via Aspera for large datasets. ENA supports Aspera (ascp) downloads which are 5-10x faster than FTP/HTTP for large files. Install IBM Aspera Connect and use the fasp URLs from API results.

  4. Check data availability before large downloads. Some ENA records reference data stored at NCBI or DDBJ. Check the submitted_ftp and fastq_ftp fields — empty fields mean the FASTQ files aren't directly available from ENA and may need to be retrieved from SRA.

  5. Use study accessions for reproducibility. Reference data by study accession (PRJNA/ERP) rather than individual run accessions in publications. Study accessions group all related data and are stable identifiers for citation.

Common Issues

Search returns no results for known accessions. Check the result type — searching read_run won't find assembly accessions. Match the result type to your query: run accessions (SRR/ERR) use read_run, study accessions (PRJNA/ERP) use study.

FASTQ download links are broken. ENA periodically reorganizes its FTP structure. Use the API to get current FTP paths rather than constructing URLs manually. If FTP links fail, try the alternative submitted_ftp or sra_ftp fields.

Downloaded FASTQ files are corrupted. Verify file integrity with md5sum against the checksums in the fastq_md5 API field. Large downloads over FTP are susceptible to interruption. Use wget -c for resumable downloads or Aspera for reliable transfer.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates