Pysam Elite

Read, manipulate, and analyze genomic data files using Pysam, a Python wrapper for htslib. This skill covers SAM/BAM/CRAM alignment processing, VCF/BCF variant file handling, FASTA/FASTQ sequence access, and building efficient genomics analysis pipelines.

When to Use This Skill

Choose Pysam Elite when you need to:

Read and filter aligned sequencing reads from BAM/CRAM files
Extract coverage, pileup, and variant information from alignments
Process VCF/BCF variant files with efficient indexed access
Build custom genomics analysis tools with low-level file access

Consider alternatives when:

You need a high-level genomics analysis framework (use SpikeInterface for neural, or scanpy for single-cell)
You need variant calling from raw data (use GATK or DeepVariant)
You need genome assembly (use SPAdes or Hifiasm)

Quick Start


pip install pysam


import pysam

# Open a BAM file
bamfile = pysam.AlignmentFile("sample.bam", "rb")

# Iterate over aligned reads in a region
for read in bamfile.fetch("chr1", 10000, 20000):
    print(f"Read: {read.query_name}, "
          f"Pos: {read.reference_start}, "
          f"MAPQ: {read.mapping_quality}, "
          f"Cigar: {read.cigarstring}")

# Get basic statistics
print(f"Total reads: {bamfile.mapped + bamfile.unmapped}")
print(f"Mapped: {bamfile.mapped}")
print(f"References: {bamfile.nreferences}")

bamfile.close()

Core Concepts

File Types and Classes

Class	File Type	Purpose
`AlignmentFile`	SAM/BAM/CRAM	Read alignments
`VariantFile`	VCF/BCF	Variant records
`FastaFile`	FASTA	Reference sequences
`FastxFile`	FASTQ	Raw sequencing reads
`TabixFile`	Tabix-indexed	Generic indexed tab files

Read Filtering and Analysis


import pysam
import numpy as np
from collections import Counter

def analyze_region(bam_path, chrom, start, end):
    """Comprehensive analysis of reads in a genomic region."""
    bamfile = pysam.AlignmentFile(bam_path, "rb")

    stats = {
        "total_reads": 0,
        "mapped": 0,
        "proper_pairs": 0,
        "duplicates": 0,
        "mapq_values": [],
        "insert_sizes": [],
        "base_qualities": []
    }

    for read in bamfile.fetch(chrom, start, end):
        stats["total_reads"] += 1

        if read.is_unmapped:
            continue
        stats["mapped"] += 1

        if read.is_proper_pair:
            stats["proper_pairs"] += 1
        if read.is_duplicate:
            stats["duplicates"] += 1

        stats["mapq_values"].append(read.mapping_quality)

        if read.is_proper_pair and not read.is_reverse:
            stats["insert_sizes"].append(abs(read.template_length))

        quals = read.query_qualities
        if quals is not None:
            stats["base_qualities"].extend(quals)

    bamfile.close()

    # Summarize
    print(f"Region: {chrom}:{start}-{end}")
    print(f"Total reads: {stats['total_reads']}")
    print(f"Mapped: {stats['mapped']}")
    print(f"Proper pairs: {stats['proper_pairs']}")
    print(f"Duplicates: {stats['duplicates']}")
    if stats["mapq_values"]:
        print(f"Mean MAPQ: {np.mean(stats['mapq_values']):.1f}")
    if stats["insert_sizes"]:
        print(f"Mean insert size: {np.mean(stats['insert_sizes']):.0f}")

    return stats

analyze_region("sample.bam", "chr1", 100000, 200000)

Coverage Calculation


import pysam
import numpy as np

def calculate_coverage(bam_path, chrom, start, end):
    """Calculate per-base coverage depth."""
    bamfile = pysam.AlignmentFile(bam_path, "rb")

    coverage = np.zeros(end - start, dtype=int)

    for column in bamfile.pileup(chrom, start, end,
                                  min_base_quality=20,
                                  min_mapping_quality=30):
        if start <= column.reference_pos < end:
            coverage[column.reference_pos - start] = column.nsegments

    bamfile.close()

    print(f"Mean coverage: {coverage.mean():.1f}x")
    print(f"Median coverage: {np.median(coverage):.1f}x")
    print(f"Min/Max: {coverage.min()}/{coverage.max()}")
    print(f"Bases ≥30x: {(coverage >= 30).sum() / len(coverage):.1%}")

    return coverage

cov = calculate_coverage("sample.bam", "chr1", 100000, 200000)

Configuration

Parameter	Description	Default
`mode`	File open mode ("rb" read BAM, "r" read SAM)	`"rb"`
`min_mapping_quality`	Minimum MAPQ for pileup/counting	`0`
`min_base_quality`	Minimum base quality for pileup	`13`
`truncate`	Truncate pileup to region bounds	`false`
`threads`	Decompression threads for BAM reading	`1`
`index_filename`	Custom index file path	Auto-detected

Best Practices

Always use indexed BAM files — Sort BAM files by coordinate (samtools sort) and index them (samtools index). Without an index, fetch() and pileup() must scan the entire file, making region queries impossibly slow on large files.
Filter reads by mapping quality — Low MAPQ reads (< 20) are ambiguously mapped and introduce noise. Apply MAPQ filters early: for read in bamfile.fetch(chrom, start, end): if read.mapping_quality >= 30: .... This is especially important for repetitive regions.
Use fetch() for read iteration, pileup() for coverage — fetch() returns reads overlapping a region (fast, read-level). pileup() returns per-position summaries with read support (slower, position-level). Don't iterate reads with pileup or calculate coverage with fetch.
Close files explicitly or use context managers — BAM file handles consume file descriptors. Use with pysam.AlignmentFile("file.bam", "rb") as bam: to ensure files are closed, especially in loops processing many files.
Set threads for large-file processing — BAM decompression is CPU-intensive. Use pysam.AlignmentFile("file.bam", "rb", threads=4) to parallelize decompression. This provides significant speedup on multi-core systems when reading large BAM files.

Common Issues

"file has no index" error on fetch/pileup — BAM files must be sorted and indexed before random access. Run pysam.sort("-o", "sorted.bam", "input.bam") and pysam.index("sorted.bam"). CRAM files need both the CRAM index (.crai) and the reference FASTA.

Reads appear outside the queried region — fetch() returns reads that overlap the region, not just reads that start within it. A 150bp read starting at position 9900 will be returned for a query at 10000-20000. Filter by read.reference_start >= start if you need reads starting within the region.

Memory usage grows when iterating large BAM files — Storing read objects in a list retains them in memory. Process reads in a streaming fashion: compute statistics within the for read in bamfile.fetch() loop rather than collecting all reads first. For whole-genome analyses, process chromosome by chromosome.

⚠️ Loading Issue

Pysam Elite

Pysam Elite

When to Use This Skill

Quick Start

Core Concepts

File Types and Classes

Read Filtering and Analysis

Coverage Calculation

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace