Biopython Toolkit

A comprehensive skill for computational molecular biology using Biopython — the open-source Python library providing tools for sequence analysis, structure parsing, phylogenetics, motif analysis, and programmatic access to biological databases like NCBI, UniProt, and PDB.

When to Use This Skill

Choose Biopython Toolkit when:

Parsing and analyzing DNA, RNA, or protein sequences
Querying NCBI databases (GenBank, PubMed, BLAST) programmatically
Working with protein structures from PDB files
Performing sequence alignment, translation, or motif analysis

Consider alternatives when:

You need high-throughput sequencing pipelines (use Nextflow/Snakemake)
You need single-cell analysis (use Scanpy/AnnData)
You're doing machine learning on sequences (use PyTorch with ESM or similar)
You need structural modeling (use PyMOL, Rosetta, or AlphaFold)

Quick Start


claude "Parse a FASTA file and find ORFs in DNA sequences"


from Bio import SeqIO
from Bio.Seq import Seq

# Parse FASTA file
for record in SeqIO.parse("sequences.fasta", "fasta"):
    print(f">{record.id} | Length: {len(record.seq)} bp")

    # Find ORFs (Open Reading Frames)
    for strand, seq in [("+", record.seq), ("-", record.seq.reverse_complement())]:
        for frame in range(3):
            protein = seq[frame:].translate()
            orfs = str(protein).split("*")
            for i, orf in enumerate(orfs):
                start = orf.find("M")
                if start != -1 and len(orf) - start >= 100:  # Min 100 aa
                    print(f"  ORF: {strand} frame {frame+1}, "
                          f"{len(orf)-start} aa starting at M")

Core Concepts

Biopython Modules

Module	Purpose	Key Functions
`Bio.Seq`	Sequence manipulation	`Seq()`, `translate()`, `complement()`
`Bio.SeqIO`	Sequence file I/O	`parse()`, `read()`, `write()`
`Bio.Entrez`	NCBI database access	`esearch()`, `efetch()`, `einfo()`
`Bio.Blast`	BLAST searches	`qblast()`, `NCBIXML.parse()`
`Bio.PDB`	Protein structure parsing	`PDBParser()`, `MMCIFParser()`
`Bio.Align`	Sequence alignment	`PairwiseAligner()`
`Bio.Phylo`	Phylogenetic trees	`read()`, `draw()`
`Bio.motifs`	Sequence motifs	`create()`, `parse()`

NCBI Database Access


from Bio import Entrez, SeqIO

Entrez.email = "[email protected]"  # Required by NCBI

# Search GenBank
handle = Entrez.esearch(db="nucleotide", term="BRCA1 Homo sapiens", retmax=5)
results = Entrez.read(handle)
print(f"Found {results['Count']} records")

# Fetch sequences
for accession in results["IdList"]:
    handle = Entrez.efetch(db="nucleotide", id=accession,
                           rettype="gb", retmode="text")
    record = SeqIO.read(handle, "genbank")
    print(f"{record.id}: {record.description[:60]}...")
    for feature in record.features:
        if feature.type == "CDS":
            print(f"  CDS: {feature.location}")

Sequence Alignment


from Bio import pairwise2
from Bio.Align import PairwiseAligner

# Modern pairwise alignment
aligner = PairwiseAligner()
aligner.mode = "local"
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.open_gap_score = -5
aligner.extend_gap_score = -0.5

seq1 = Seq("ATGCGATCGATCGATCG")
seq2 = Seq("ATGCAATCAATCGATCG")

alignments = aligner.align(seq1, seq2)
best = alignments[0]
print(f"Score: {best.score}")
print(best)

Configuration

Parameter	Description	Default
`Entrez.email`	Email for NCBI API (required)	Required
`Entrez.api_key`	NCBI API key for higher rate limit	None
`sequence_format`	Default file format	`fasta`
`alignment_mode`	Global or local alignment	`global`
`blast_database`	Default BLAST database	`nr`

Best Practices

Always set Entrez.email before NCBI queries. NCBI requires identification for API usage. Set your email and optionally an API key (register at NCBI) to increase the rate limit from 3 to 10 requests per second.
Use SeqIO.parse() for multiple records, SeqIO.read() for single. parse() returns an iterator (memory efficient for large files), while read() expects exactly one record and raises an error if the file contains more.
Handle large files with iterators. When processing genome-scale FASTA files, use SeqIO.parse() in a loop rather than loading all records into memory. For indexed access to specific records, use SeqIO.index() which builds a dictionary-like interface backed by the file.
Convert between formats with SeqIO. Biopython supports dozens of sequence formats. Convert with: SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta"). Check supported formats in the documentation before attempting conversion.
Use Bio.PDB with structure quality checks. When parsing PDB files, check for missing residues, alternative conformations, and disordered atoms. Use structure.get_unpacked_list() to see all atoms, including those with multiple conformations.

Common Issues

Entrez.efetch returns empty or malformed results. NCBI servers can be overloaded. Implement retry logic with 3-second delays between retries. Also verify the database name and ID format — GenBank IDs and accession numbers are not interchangeable in all endpoints.

Sequence translation produces unexpected stop codons. Check the correct reading frame and genetic code table. Mitochondrial, bacterial, and nuclear genomes use different codon tables. Specify with seq.translate(table="Bacterial") or the appropriate NCBI table number.

PDB parser fails on large or non-standard files. Some PDB files have formatting irregularities. Try PDBParser(QUIET=True) to suppress warnings, or switch to MMCIFParser() for mmCIF format which is more standardized. For very large structures, consider using only the atoms you need by filtering during parsing.

⚠️ Loading Issue

Biopython Toolkit

Biopython Toolkit

When to Use This Skill

Quick Start

Core Concepts

Biopython Modules

NCBI Database Access

Sequence Alignment

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace