Pysam Elite
Streamline your workflow with this genomic, file, toolkit, read. Includes structured workflows, validation checks, and reusable patterns for scientific.
Pysam Elite
Read, manipulate, and analyze genomic data files using Pysam, a Python wrapper for htslib. This skill covers SAM/BAM/CRAM alignment processing, VCF/BCF variant file handling, FASTA/FASTQ sequence access, and building efficient genomics analysis pipelines.
When to Use This Skill
Choose Pysam Elite when you need to:
- Read and filter aligned sequencing reads from BAM/CRAM files
- Extract coverage, pileup, and variant information from alignments
- Process VCF/BCF variant files with efficient indexed access
- Build custom genomics analysis tools with low-level file access
Consider alternatives when:
- You need a high-level genomics analysis framework (use SpikeInterface for neural, or scanpy for single-cell)
- You need variant calling from raw data (use GATK or DeepVariant)
- You need genome assembly (use SPAdes or Hifiasm)
Quick Start
pip install pysam
import pysam # Open a BAM file bamfile = pysam.AlignmentFile("sample.bam", "rb") # Iterate over aligned reads in a region for read in bamfile.fetch("chr1", 10000, 20000): print(f"Read: {read.query_name}, " f"Pos: {read.reference_start}, " f"MAPQ: {read.mapping_quality}, " f"Cigar: {read.cigarstring}") # Get basic statistics print(f"Total reads: {bamfile.mapped + bamfile.unmapped}") print(f"Mapped: {bamfile.mapped}") print(f"References: {bamfile.nreferences}") bamfile.close()
Core Concepts
File Types and Classes
| Class | File Type | Purpose |
|---|---|---|
AlignmentFile | SAM/BAM/CRAM | Read alignments |
VariantFile | VCF/BCF | Variant records |
FastaFile | FASTA | Reference sequences |
FastxFile | FASTQ | Raw sequencing reads |
TabixFile | Tabix-indexed | Generic indexed tab files |
Read Filtering and Analysis
import pysam import numpy as np from collections import Counter def analyze_region(bam_path, chrom, start, end): """Comprehensive analysis of reads in a genomic region.""" bamfile = pysam.AlignmentFile(bam_path, "rb") stats = { "total_reads": 0, "mapped": 0, "proper_pairs": 0, "duplicates": 0, "mapq_values": [], "insert_sizes": [], "base_qualities": [] } for read in bamfile.fetch(chrom, start, end): stats["total_reads"] += 1 if read.is_unmapped: continue stats["mapped"] += 1 if read.is_proper_pair: stats["proper_pairs"] += 1 if read.is_duplicate: stats["duplicates"] += 1 stats["mapq_values"].append(read.mapping_quality) if read.is_proper_pair and not read.is_reverse: stats["insert_sizes"].append(abs(read.template_length)) quals = read.query_qualities if quals is not None: stats["base_qualities"].extend(quals) bamfile.close() # Summarize print(f"Region: {chrom}:{start}-{end}") print(f"Total reads: {stats['total_reads']}") print(f"Mapped: {stats['mapped']}") print(f"Proper pairs: {stats['proper_pairs']}") print(f"Duplicates: {stats['duplicates']}") if stats["mapq_values"]: print(f"Mean MAPQ: {np.mean(stats['mapq_values']):.1f}") if stats["insert_sizes"]: print(f"Mean insert size: {np.mean(stats['insert_sizes']):.0f}") return stats analyze_region("sample.bam", "chr1", 100000, 200000)
Coverage Calculation
import pysam import numpy as np def calculate_coverage(bam_path, chrom, start, end): """Calculate per-base coverage depth.""" bamfile = pysam.AlignmentFile(bam_path, "rb") coverage = np.zeros(end - start, dtype=int) for column in bamfile.pileup(chrom, start, end, min_base_quality=20, min_mapping_quality=30): if start <= column.reference_pos < end: coverage[column.reference_pos - start] = column.nsegments bamfile.close() print(f"Mean coverage: {coverage.mean():.1f}x") print(f"Median coverage: {np.median(coverage):.1f}x") print(f"Min/Max: {coverage.min()}/{coverage.max()}") print(f"Bases ≥30x: {(coverage >= 30).sum() / len(coverage):.1%}") return coverage cov = calculate_coverage("sample.bam", "chr1", 100000, 200000)
Configuration
| Parameter | Description | Default |
|---|---|---|
mode | File open mode ("rb" read BAM, "r" read SAM) | "rb" |
min_mapping_quality | Minimum MAPQ for pileup/counting | 0 |
min_base_quality | Minimum base quality for pileup | 13 |
truncate | Truncate pileup to region bounds | false |
threads | Decompression threads for BAM reading | 1 |
index_filename | Custom index file path | Auto-detected |
Best Practices
-
Always use indexed BAM files — Sort BAM files by coordinate (
samtools sort) and index them (samtools index). Without an index,fetch()andpileup()must scan the entire file, making region queries impossibly slow on large files. -
Filter reads by mapping quality — Low MAPQ reads (< 20) are ambiguously mapped and introduce noise. Apply MAPQ filters early:
for read in bamfile.fetch(chrom, start, end): if read.mapping_quality >= 30: .... This is especially important for repetitive regions. -
Use
fetch()for read iteration,pileup()for coverage —fetch()returns reads overlapping a region (fast, read-level).pileup()returns per-position summaries with read support (slower, position-level). Don't iterate reads with pileup or calculate coverage with fetch. -
Close files explicitly or use context managers — BAM file handles consume file descriptors. Use
with pysam.AlignmentFile("file.bam", "rb") as bam:to ensure files are closed, especially in loops processing many files. -
Set
threadsfor large-file processing — BAM decompression is CPU-intensive. Usepysam.AlignmentFile("file.bam", "rb", threads=4)to parallelize decompression. This provides significant speedup on multi-core systems when reading large BAM files.
Common Issues
"file has no index" error on fetch/pileup — BAM files must be sorted and indexed before random access. Run pysam.sort("-o", "sorted.bam", "input.bam") and pysam.index("sorted.bam"). CRAM files need both the CRAM index (.crai) and the reference FASTA.
Reads appear outside the queried region — fetch() returns reads that overlap the region, not just reads that start within it. A 150bp read starting at position 9900 will be returned for a query at 10000-20000. Filter by read.reference_start >= start if you need reads starting within the region.
Memory usage grows when iterating large BAM files — Storing read objects in a list retains them in memory. Process reads in a streaming fashion: compute statistics within the for read in bamfile.fetch() loop rather than collecting all reads first. For whole-genome analyses, process chromosome by chromosome.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.