P

Pysam Elite

Streamline your workflow with this genomic, file, toolkit, read. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Pysam Elite

Read, manipulate, and analyze genomic data files using Pysam, a Python wrapper for htslib. This skill covers SAM/BAM/CRAM alignment processing, VCF/BCF variant file handling, FASTA/FASTQ sequence access, and building efficient genomics analysis pipelines.

When to Use This Skill

Choose Pysam Elite when you need to:

  • Read and filter aligned sequencing reads from BAM/CRAM files
  • Extract coverage, pileup, and variant information from alignments
  • Process VCF/BCF variant files with efficient indexed access
  • Build custom genomics analysis tools with low-level file access

Consider alternatives when:

  • You need a high-level genomics analysis framework (use SpikeInterface for neural, or scanpy for single-cell)
  • You need variant calling from raw data (use GATK or DeepVariant)
  • You need genome assembly (use SPAdes or Hifiasm)

Quick Start

pip install pysam
import pysam # Open a BAM file bamfile = pysam.AlignmentFile("sample.bam", "rb") # Iterate over aligned reads in a region for read in bamfile.fetch("chr1", 10000, 20000): print(f"Read: {read.query_name}, " f"Pos: {read.reference_start}, " f"MAPQ: {read.mapping_quality}, " f"Cigar: {read.cigarstring}") # Get basic statistics print(f"Total reads: {bamfile.mapped + bamfile.unmapped}") print(f"Mapped: {bamfile.mapped}") print(f"References: {bamfile.nreferences}") bamfile.close()

Core Concepts

File Types and Classes

ClassFile TypePurpose
AlignmentFileSAM/BAM/CRAMRead alignments
VariantFileVCF/BCFVariant records
FastaFileFASTAReference sequences
FastxFileFASTQRaw sequencing reads
TabixFileTabix-indexedGeneric indexed tab files

Read Filtering and Analysis

import pysam import numpy as np from collections import Counter def analyze_region(bam_path, chrom, start, end): """Comprehensive analysis of reads in a genomic region.""" bamfile = pysam.AlignmentFile(bam_path, "rb") stats = { "total_reads": 0, "mapped": 0, "proper_pairs": 0, "duplicates": 0, "mapq_values": [], "insert_sizes": [], "base_qualities": [] } for read in bamfile.fetch(chrom, start, end): stats["total_reads"] += 1 if read.is_unmapped: continue stats["mapped"] += 1 if read.is_proper_pair: stats["proper_pairs"] += 1 if read.is_duplicate: stats["duplicates"] += 1 stats["mapq_values"].append(read.mapping_quality) if read.is_proper_pair and not read.is_reverse: stats["insert_sizes"].append(abs(read.template_length)) quals = read.query_qualities if quals is not None: stats["base_qualities"].extend(quals) bamfile.close() # Summarize print(f"Region: {chrom}:{start}-{end}") print(f"Total reads: {stats['total_reads']}") print(f"Mapped: {stats['mapped']}") print(f"Proper pairs: {stats['proper_pairs']}") print(f"Duplicates: {stats['duplicates']}") if stats["mapq_values"]: print(f"Mean MAPQ: {np.mean(stats['mapq_values']):.1f}") if stats["insert_sizes"]: print(f"Mean insert size: {np.mean(stats['insert_sizes']):.0f}") return stats analyze_region("sample.bam", "chr1", 100000, 200000)

Coverage Calculation

import pysam import numpy as np def calculate_coverage(bam_path, chrom, start, end): """Calculate per-base coverage depth.""" bamfile = pysam.AlignmentFile(bam_path, "rb") coverage = np.zeros(end - start, dtype=int) for column in bamfile.pileup(chrom, start, end, min_base_quality=20, min_mapping_quality=30): if start <= column.reference_pos < end: coverage[column.reference_pos - start] = column.nsegments bamfile.close() print(f"Mean coverage: {coverage.mean():.1f}x") print(f"Median coverage: {np.median(coverage):.1f}x") print(f"Min/Max: {coverage.min()}/{coverage.max()}") print(f"Bases ≥30x: {(coverage >= 30).sum() / len(coverage):.1%}") return coverage cov = calculate_coverage("sample.bam", "chr1", 100000, 200000)

Configuration

ParameterDescriptionDefault
modeFile open mode ("rb" read BAM, "r" read SAM)"rb"
min_mapping_qualityMinimum MAPQ for pileup/counting0
min_base_qualityMinimum base quality for pileup13
truncateTruncate pileup to region boundsfalse
threadsDecompression threads for BAM reading1
index_filenameCustom index file pathAuto-detected

Best Practices

  1. Always use indexed BAM files — Sort BAM files by coordinate (samtools sort) and index them (samtools index). Without an index, fetch() and pileup() must scan the entire file, making region queries impossibly slow on large files.

  2. Filter reads by mapping quality — Low MAPQ reads (< 20) are ambiguously mapped and introduce noise. Apply MAPQ filters early: for read in bamfile.fetch(chrom, start, end): if read.mapping_quality >= 30: .... This is especially important for repetitive regions.

  3. Use fetch() for read iteration, pileup() for coveragefetch() returns reads overlapping a region (fast, read-level). pileup() returns per-position summaries with read support (slower, position-level). Don't iterate reads with pileup or calculate coverage with fetch.

  4. Close files explicitly or use context managers — BAM file handles consume file descriptors. Use with pysam.AlignmentFile("file.bam", "rb") as bam: to ensure files are closed, especially in loops processing many files.

  5. Set threads for large-file processing — BAM decompression is CPU-intensive. Use pysam.AlignmentFile("file.bam", "rb", threads=4) to parallelize decompression. This provides significant speedup on multi-core systems when reading large BAM files.

Common Issues

"file has no index" error on fetch/pileup — BAM files must be sorted and indexed before random access. Run pysam.sort("-o", "sorted.bam", "input.bam") and pysam.index("sorted.bam"). CRAM files need both the CRAM index (.crai) and the reference FASTA.

Reads appear outside the queried regionfetch() returns reads that overlap the region, not just reads that start within it. A 150bp read starting at position 9900 will be returned for a query at 10000-20000. Filter by read.reference_start >= start if you need reads starting within the region.

Memory usage grows when iterating large BAM files — Storing read objects in a list retains them in memory. Process reads in a streaming fashion: compute statistics within the for read in bamfile.fetch() loop rather than collecting all reads first. For whole-genome analyses, process chromosome by chromosome.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates