Biopython Toolkit
Powerful skill for primary, python, toolkit, molecular. Includes structured workflows, validation checks, and reusable patterns for scientific.
Biopython Toolkit
A comprehensive skill for computational molecular biology using Biopython — the open-source Python library providing tools for sequence analysis, structure parsing, phylogenetics, motif analysis, and programmatic access to biological databases like NCBI, UniProt, and PDB.
When to Use This Skill
Choose Biopython Toolkit when:
- Parsing and analyzing DNA, RNA, or protein sequences
- Querying NCBI databases (GenBank, PubMed, BLAST) programmatically
- Working with protein structures from PDB files
- Performing sequence alignment, translation, or motif analysis
Consider alternatives when:
- You need high-throughput sequencing pipelines (use Nextflow/Snakemake)
- You need single-cell analysis (use Scanpy/AnnData)
- You're doing machine learning on sequences (use PyTorch with ESM or similar)
- You need structural modeling (use PyMOL, Rosetta, or AlphaFold)
Quick Start
claude "Parse a FASTA file and find ORFs in DNA sequences"
from Bio import SeqIO from Bio.Seq import Seq # Parse FASTA file for record in SeqIO.parse("sequences.fasta", "fasta"): print(f">{record.id} | Length: {len(record.seq)} bp") # Find ORFs (Open Reading Frames) for strand, seq in [("+", record.seq), ("-", record.seq.reverse_complement())]: for frame in range(3): protein = seq[frame:].translate() orfs = str(protein).split("*") for i, orf in enumerate(orfs): start = orf.find("M") if start != -1 and len(orf) - start >= 100: # Min 100 aa print(f" ORF: {strand} frame {frame+1}, " f"{len(orf)-start} aa starting at M")
Core Concepts
Biopython Modules
| Module | Purpose | Key Functions |
|---|---|---|
Bio.Seq | Sequence manipulation | Seq(), translate(), complement() |
Bio.SeqIO | Sequence file I/O | parse(), read(), write() |
Bio.Entrez | NCBI database access | esearch(), efetch(), einfo() |
Bio.Blast | BLAST searches | qblast(), NCBIXML.parse() |
Bio.PDB | Protein structure parsing | PDBParser(), MMCIFParser() |
Bio.Align | Sequence alignment | PairwiseAligner() |
Bio.Phylo | Phylogenetic trees | read(), draw() |
Bio.motifs | Sequence motifs | create(), parse() |
NCBI Database Access
from Bio import Entrez, SeqIO Entrez.email = "[email protected]" # Required by NCBI # Search GenBank handle = Entrez.esearch(db="nucleotide", term="BRCA1 Homo sapiens", retmax=5) results = Entrez.read(handle) print(f"Found {results['Count']} records") # Fetch sequences for accession in results["IdList"]: handle = Entrez.efetch(db="nucleotide", id=accession, rettype="gb", retmode="text") record = SeqIO.read(handle, "genbank") print(f"{record.id}: {record.description[:60]}...") for feature in record.features: if feature.type == "CDS": print(f" CDS: {feature.location}")
Sequence Alignment
from Bio import pairwise2 from Bio.Align import PairwiseAligner # Modern pairwise alignment aligner = PairwiseAligner() aligner.mode = "local" aligner.match_score = 2 aligner.mismatch_score = -1 aligner.open_gap_score = -5 aligner.extend_gap_score = -0.5 seq1 = Seq("ATGCGATCGATCGATCG") seq2 = Seq("ATGCAATCAATCGATCG") alignments = aligner.align(seq1, seq2) best = alignments[0] print(f"Score: {best.score}") print(best)
Configuration
| Parameter | Description | Default |
|---|---|---|
Entrez.email | Email for NCBI API (required) | Required |
Entrez.api_key | NCBI API key for higher rate limit | None |
sequence_format | Default file format | fasta |
alignment_mode | Global or local alignment | global |
blast_database | Default BLAST database | nr |
Best Practices
-
Always set
Entrez.emailbefore NCBI queries. NCBI requires identification for API usage. Set your email and optionally an API key (register at NCBI) to increase the rate limit from 3 to 10 requests per second. -
Use
SeqIO.parse()for multiple records,SeqIO.read()for single.parse()returns an iterator (memory efficient for large files), whileread()expects exactly one record and raises an error if the file contains more. -
Handle large files with iterators. When processing genome-scale FASTA files, use
SeqIO.parse()in a loop rather than loading all records into memory. For indexed access to specific records, useSeqIO.index()which builds a dictionary-like interface backed by the file. -
Convert between formats with SeqIO. Biopython supports dozens of sequence formats. Convert with:
SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta"). Check supported formats in the documentation before attempting conversion. -
Use
Bio.PDBwith structure quality checks. When parsing PDB files, check for missing residues, alternative conformations, and disordered atoms. Usestructure.get_unpacked_list()to see all atoms, including those with multiple conformations.
Common Issues
Entrez.efetch returns empty or malformed results. NCBI servers can be overloaded. Implement retry logic with 3-second delays between retries. Also verify the database name and ID format — GenBank IDs and accession numbers are not interchangeable in all endpoints.
Sequence translation produces unexpected stop codons. Check the correct reading frame and genetic code table. Mitochondrial, bacterial, and nuclear genomes use different codon tables. Specify with seq.translate(table="Bacterial") or the appropriate NCBI table number.
PDB parser fails on large or non-standard files. Some PDB files have formatting irregularities. Try PDBParser(QUIET=True) to suppress warnings, or switch to MMCIFParser() for mmCIF format which is more standardized. For very large structures, consider using only the atoms you need by filtering during parsing.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.