Scikit Bio Complete
Enterprise-grade skill for biological, data, toolkit, sequence. Includes structured workflows, validation checks, and reusable patterns for scientific.
Scikit Bio Complete
Perform bioinformatics analyses with scikit-bio, a Python library for biological sequence manipulation, alignment, phylogenetics, microbial ecology diversity metrics, and multivariate ordination. This skill covers DNA/RNA/protein sequence operations, distance matrix computation, phylogenetic tree construction, and ecological diversity analysis.
When to Use This Skill
Choose Scikit Bio Complete when you need to:
- Manipulate, align, and analyze biological sequences (DNA, RNA, protein)
- Compute alpha and beta diversity metrics for microbial community data
- Build distance matrices and perform ordination (PCoA, NMDS) for ecological data
- Construct and manipulate phylogenetic trees programmatically
Consider alternatives when:
- You need high-throughput sequencing pipeline processing (use Snakemake + QIIME2)
- You need single-cell RNA-seq analysis (use Scanpy)
- You need protein structure prediction or analysis (use BioPython + AlphaFold)
Quick Start
pip install scikit-bio numpy pandas matplotlib
import skbio from skbio import DNA, RNA, Protein, TabularMSA from skbio import DistanceMatrix from skbio.diversity import alpha_diversity, beta_diversity import numpy as np # Sequence manipulation seq = DNA("ATGCGATCGATCGATCG") print(f"Sequence: {seq}") print(f"Length: {len(seq)}") print(f"GC content: {seq.gc_content():.2%}") print(f"Complement: {seq.complement()}") print(f"Reverse complement: {seq.reverse_complement()}") # Transcribe to RNA rna = seq.transcribe() print(f"RNA: {rna}") # Alpha diversity on OTU table counts = np.array([ [10, 20, 5, 0, 3], # Sample 1 [0, 15, 25, 10, 1], # Sample 2 [30, 5, 0, 0, 8], # Sample 3 ]) sample_ids = ['S1', 'S2', 'S3'] # Calculate Shannon diversity shannon = alpha_diversity('shannon', counts, ids=sample_ids) print(f"\nShannon diversity:\n{shannon}") # Calculate Bray-Curtis beta diversity bc_dm = beta_diversity('braycurtis', counts, ids=sample_ids) print(f"\nBray-Curtis distance matrix:\n{bc_dm}")
Core Concepts
Module Overview
| Module | Purpose | Key Functions |
|---|---|---|
skbio.sequence | Sequence types and operations | DNA, RNA, Protein, GeneticCode |
skbio.alignment | Sequence alignment | TabularMSA, local_pairwise_align_ssw |
skbio.diversity | Ecological diversity metrics | alpha_diversity, beta_diversity |
skbio.stats.distance | Distance matrix operations | DistanceMatrix, permanova, anosim |
skbio.stats.ordination | Ordination methods | pcoa, rda, cca |
skbio.tree | Phylogenetic trees | TreeNode |
skbio.io | File format I/O | FASTA, FASTQ, Newick, phyloXML |
Microbial Ecology Pipeline
import skbio from skbio.diversity import alpha_diversity, beta_diversity from skbio.stats.distance import permanova, anosim from skbio.stats.ordination import pcoa import numpy as np import pandas as pd import matplotlib.pyplot as plt # Simulate OTU count data (samples × species) np.random.seed(42) n_samples = 20 n_species = 50 # Two groups: healthy (10) and diseased (10) healthy_counts = np.random.poisson(lam=5, size=(10, n_species)) diseased_counts = np.random.poisson(lam=3, size=(10, n_species)) # Increase abundance of some taxa in diseased diseased_counts[:, :5] *= 3 counts = np.vstack([healthy_counts, diseased_counts]) sample_ids = [f'H{i}' for i in range(10)] + [f'D{i}' for i in range(10)] grouping = pd.Series(['healthy'] * 10 + ['diseased'] * 10, index=sample_ids) # Alpha diversity comparison metrics = ['shannon', 'observed_otus', 'simpson'] for metric in metrics: alpha = alpha_diversity(metric, counts, ids=sample_ids) h_mean = alpha[grouping == 'healthy'].mean() d_mean = alpha[grouping == 'diseased'].mean() print(f"{metric}: healthy={h_mean:.3f}, diseased={d_mean:.3f}") # Beta diversity + PCoA bc_dm = beta_diversity('braycurtis', counts, ids=sample_ids) # PERMANOVA test for group differences result = permanova(bc_dm, grouping, permutations=999) print(f"\nPERMANOVA: F={result['test statistic']:.3f}, " f"p={result['p-value']:.4f}") # Ordination plot pcoa_results = pcoa(bc_dm) samples = pcoa_results.samples prop_explained = pcoa_results.proportion_explained fig, ax = plt.subplots(figsize=(6, 5)) for group, color in [('healthy', '#2ecc71'), ('diseased', '#e74c3c')]: mask = grouping == group ax.scatter(samples.loc[mask, 'PC1'], samples.loc[mask, 'PC2'], c=color, label=group, s=60, edgecolor='white', linewidth=0.5) ax.set_xlabel(f'PC1 ({prop_explained.iloc[0]:.1%} variance)') ax.set_ylabel(f'PC2 ({prop_explained.iloc[1]:.1%} variance)') ax.set_title('PCoA of Bray-Curtis Distances') ax.legend() fig.tight_layout() fig.savefig('pcoa_plot.pdf') print("PCoA plot saved")
Phylogenetic Analysis
from skbio import TreeNode, DistanceMatrix from skbio.tree import nj # Build a neighbor-joining tree from a distance matrix ids = ['E_coli', 'S_aureus', 'B_subtilis', 'P_aeruginosa', 'M_tuberculosis'] dm_data = [ [0.0, 0.7, 0.3, 0.8, 0.9], [0.7, 0.0, 0.6, 0.85, 0.75], [0.3, 0.6, 0.0, 0.7, 0.8], [0.8, 0.85, 0.7, 0.0, 0.65], [0.9, 0.75, 0.8, 0.65, 0.0], ] dm = DistanceMatrix(dm_data, ids=ids) # Construct tree tree = nj(dm) print("Newick format:") print(tree) # Tree operations print(f"\nTotal tips: {len(list(tree.tips()))}") print(f"Total nodes: {tree.count()}") # Find subtree for tip in tree.tips(): dist_to_root = tip.distance(tree) print(f" {tip.name}: distance to root = {dist_to_root:.3f}") # Read/write Newick tree.write('/tmp/phylo_tree.nwk', format='newick') loaded_tree = TreeNode.read('/tmp/phylo_tree.nwk', format='newick')
Configuration
| Parameter | Description | Default |
|---|---|---|
alpha_metric | Alpha diversity metric | "shannon" |
beta_metric | Beta diversity distance metric | "braycurtis" |
permutations | Number of permutations for statistical tests | 999 |
n_components | Number of ordination axes to retain | 3 |
sequence_type | Default sequence class (DNA, RNA, Protein) | "DNA" |
validate_sequences | Validate characters on sequence creation | true |
tree_format | Phylogenetic tree I/O format | "newick" |
random_seed | Seed for reproducible permutation tests | None |
Best Practices
-
Rarefy count data before diversity analysis — Uneven sequencing depth across samples biases diversity metrics. Rarefy to the minimum sample depth or use rarefaction curves to choose an appropriate depth. Never compare Shannon diversity between samples with vastly different total counts without rarefaction.
-
Use appropriate distance metrics for your data type — Bray-Curtis for abundance data, Jaccard for presence/absence, UniFrac for phylogeny-aware distances. Euclidean distance on raw counts is almost always wrong for ecological data because it overweights abundant species.
-
Run PERMANOVA with sufficient permutations — Use at least 999 permutations (9999 for publication). Check the
p-valuebut also report the R² (test statistic / total sum of squares) as a measure of effect size. PERMANOVA assumes equal group dispersions — verify withbetadispertests. -
Validate sequences on import — Set
validate=Truewhen creating sequence objects to catch invalid characters early. DNA sequences with ambiguity codes (N, R, Y) needDNA(seq, validate=True)which accepts IUPAC characters. Invalid characters cause silent errors in downstream analysis. -
Cache distance matrices for iterative analysis — Distance matrix computation on large datasets (>1000 samples) is slow. Save computed matrices with
dm.write('/path/to/dm.txt')and reload withDistanceMatrix.read()rather than recomputing for each analysis or visualization tweak.
Common Issues
"ValueError: Counts must be non-negative integers" in diversity functions — Alpha and beta diversity functions require integer count data, not relative abundances or floats. If your data is in relative abundance format, multiply by a scaling factor and round: counts = np.round(rel_abundance * 10000).astype(int).
Distance matrix IDs don't match grouping variable — PERMANOVA and ANOSIM require the distance matrix sample IDs to exactly match the grouping Series index. Verify with set(dm.ids) == set(grouping.index). Mismatches often come from leading/trailing whitespace in sample names — strip both before analysis.
TreeNode operations fail with "has no attribute" errors — scikit-bio tree methods differ from ETE3 or dendropy. The API uses tree.tips() (not get_leaves()), tree.count() (not len), and node.distance(other) (not get_distance()). Check the scikit-bio TreeNode documentation rather than assuming compatibility with other tree libraries.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.