B

Biopython Toolkit

Powerful skill for primary, python, toolkit, molecular. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Biopython Toolkit

A comprehensive skill for computational molecular biology using Biopython — the open-source Python library providing tools for sequence analysis, structure parsing, phylogenetics, motif analysis, and programmatic access to biological databases like NCBI, UniProt, and PDB.

When to Use This Skill

Choose Biopython Toolkit when:

  • Parsing and analyzing DNA, RNA, or protein sequences
  • Querying NCBI databases (GenBank, PubMed, BLAST) programmatically
  • Working with protein structures from PDB files
  • Performing sequence alignment, translation, or motif analysis

Consider alternatives when:

  • You need high-throughput sequencing pipelines (use Nextflow/Snakemake)
  • You need single-cell analysis (use Scanpy/AnnData)
  • You're doing machine learning on sequences (use PyTorch with ESM or similar)
  • You need structural modeling (use PyMOL, Rosetta, or AlphaFold)

Quick Start

claude "Parse a FASTA file and find ORFs in DNA sequences"
from Bio import SeqIO from Bio.Seq import Seq # Parse FASTA file for record in SeqIO.parse("sequences.fasta", "fasta"): print(f">{record.id} | Length: {len(record.seq)} bp") # Find ORFs (Open Reading Frames) for strand, seq in [("+", record.seq), ("-", record.seq.reverse_complement())]: for frame in range(3): protein = seq[frame:].translate() orfs = str(protein).split("*") for i, orf in enumerate(orfs): start = orf.find("M") if start != -1 and len(orf) - start >= 100: # Min 100 aa print(f" ORF: {strand} frame {frame+1}, " f"{len(orf)-start} aa starting at M")

Core Concepts

Biopython Modules

ModulePurposeKey Functions
Bio.SeqSequence manipulationSeq(), translate(), complement()
Bio.SeqIOSequence file I/Oparse(), read(), write()
Bio.EntrezNCBI database accessesearch(), efetch(), einfo()
Bio.BlastBLAST searchesqblast(), NCBIXML.parse()
Bio.PDBProtein structure parsingPDBParser(), MMCIFParser()
Bio.AlignSequence alignmentPairwiseAligner()
Bio.PhyloPhylogenetic treesread(), draw()
Bio.motifsSequence motifscreate(), parse()

NCBI Database Access

from Bio import Entrez, SeqIO Entrez.email = "[email protected]" # Required by NCBI # Search GenBank handle = Entrez.esearch(db="nucleotide", term="BRCA1 Homo sapiens", retmax=5) results = Entrez.read(handle) print(f"Found {results['Count']} records") # Fetch sequences for accession in results["IdList"]: handle = Entrez.efetch(db="nucleotide", id=accession, rettype="gb", retmode="text") record = SeqIO.read(handle, "genbank") print(f"{record.id}: {record.description[:60]}...") for feature in record.features: if feature.type == "CDS": print(f" CDS: {feature.location}")

Sequence Alignment

from Bio import pairwise2 from Bio.Align import PairwiseAligner # Modern pairwise alignment aligner = PairwiseAligner() aligner.mode = "local" aligner.match_score = 2 aligner.mismatch_score = -1 aligner.open_gap_score = -5 aligner.extend_gap_score = -0.5 seq1 = Seq("ATGCGATCGATCGATCG") seq2 = Seq("ATGCAATCAATCGATCG") alignments = aligner.align(seq1, seq2) best = alignments[0] print(f"Score: {best.score}") print(best)

Configuration

ParameterDescriptionDefault
Entrez.emailEmail for NCBI API (required)Required
Entrez.api_keyNCBI API key for higher rate limitNone
sequence_formatDefault file formatfasta
alignment_modeGlobal or local alignmentglobal
blast_databaseDefault BLAST databasenr

Best Practices

  1. Always set Entrez.email before NCBI queries. NCBI requires identification for API usage. Set your email and optionally an API key (register at NCBI) to increase the rate limit from 3 to 10 requests per second.

  2. Use SeqIO.parse() for multiple records, SeqIO.read() for single. parse() returns an iterator (memory efficient for large files), while read() expects exactly one record and raises an error if the file contains more.

  3. Handle large files with iterators. When processing genome-scale FASTA files, use SeqIO.parse() in a loop rather than loading all records into memory. For indexed access to specific records, use SeqIO.index() which builds a dictionary-like interface backed by the file.

  4. Convert between formats with SeqIO. Biopython supports dozens of sequence formats. Convert with: SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta"). Check supported formats in the documentation before attempting conversion.

  5. Use Bio.PDB with structure quality checks. When parsing PDB files, check for missing residues, alternative conformations, and disordered atoms. Use structure.get_unpacked_list() to see all atoms, including those with multiple conformations.

Common Issues

Entrez.efetch returns empty or malformed results. NCBI servers can be overloaded. Implement retry logic with 3-second delays between retries. Also verify the database name and ID format — GenBank IDs and accession numbers are not interchangeable in all endpoints.

Sequence translation produces unexpected stop codons. Check the correct reading frame and genetic code table. Mitochondrial, bacterial, and nuclear genomes use different codon tables. Specify with seq.translate(table="Bacterial") or the appropriate NCBI table number.

PDB parser fails on large or non-standard files. Some PDB files have formatting irregularities. Try PDBParser(QUIET=True) to suppress warnings, or switch to MMCIFParser() for mmCIF format which is more standardized. For very large structures, consider using only the atoms you need by filtering during parsing.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates