Scikit Bio Complete

Perform bioinformatics analyses with scikit-bio, a Python library for biological sequence manipulation, alignment, phylogenetics, microbial ecology diversity metrics, and multivariate ordination. This skill covers DNA/RNA/protein sequence operations, distance matrix computation, phylogenetic tree construction, and ecological diversity analysis.

When to Use This Skill

Choose Scikit Bio Complete when you need to:

Manipulate, align, and analyze biological sequences (DNA, RNA, protein)
Compute alpha and beta diversity metrics for microbial community data
Build distance matrices and perform ordination (PCoA, NMDS) for ecological data
Construct and manipulate phylogenetic trees programmatically

Consider alternatives when:

You need high-throughput sequencing pipeline processing (use Snakemake + QIIME2)
You need single-cell RNA-seq analysis (use Scanpy)
You need protein structure prediction or analysis (use BioPython + AlphaFold)

Quick Start


pip install scikit-bio numpy pandas matplotlib


import skbio
from skbio import DNA, RNA, Protein, TabularMSA
from skbio import DistanceMatrix
from skbio.diversity import alpha_diversity, beta_diversity
import numpy as np

# Sequence manipulation
seq = DNA("ATGCGATCGATCGATCG")
print(f"Sequence: {seq}")
print(f"Length: {len(seq)}")
print(f"GC content: {seq.gc_content():.2%}")
print(f"Complement: {seq.complement()}")
print(f"Reverse complement: {seq.reverse_complement()}")

# Transcribe to RNA
rna = seq.transcribe()
print(f"RNA: {rna}")

# Alpha diversity on OTU table
counts = np.array([
    [10, 20, 5, 0, 3],   # Sample 1
    [0, 15, 25, 10, 1],   # Sample 2
    [30, 5, 0, 0, 8],     # Sample 3
])
sample_ids = ['S1', 'S2', 'S3']

# Calculate Shannon diversity
shannon = alpha_diversity('shannon', counts, ids=sample_ids)
print(f"\nShannon diversity:\n{shannon}")

# Calculate Bray-Curtis beta diversity
bc_dm = beta_diversity('braycurtis', counts, ids=sample_ids)
print(f"\nBray-Curtis distance matrix:\n{bc_dm}")

Core Concepts

Module Overview

Module	Purpose	Key Functions
`skbio.sequence`	Sequence types and operations	`DNA`, `RNA`, `Protein`, `GeneticCode`
`skbio.alignment`	Sequence alignment	`TabularMSA`, `local_pairwise_align_ssw`
`skbio.diversity`	Ecological diversity metrics	`alpha_diversity`, `beta_diversity`
`skbio.stats.distance`	Distance matrix operations	`DistanceMatrix`, `permanova`, `anosim`
`skbio.stats.ordination`	Ordination methods	`pcoa`, `rda`, `cca`
`skbio.tree`	Phylogenetic trees	`TreeNode`
`skbio.io`	File format I/O	FASTA, FASTQ, Newick, phyloXML

Microbial Ecology Pipeline


import skbio
from skbio.diversity import alpha_diversity, beta_diversity
from skbio.stats.distance import permanova, anosim
from skbio.stats.ordination import pcoa
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Simulate OTU count data (samples × species)
np.random.seed(42)
n_samples = 20
n_species = 50

# Two groups: healthy (10) and diseased (10)
healthy_counts = np.random.poisson(lam=5, size=(10, n_species))
diseased_counts = np.random.poisson(lam=3, size=(10, n_species))
# Increase abundance of some taxa in diseased
diseased_counts[:, :5] *= 3

counts = np.vstack([healthy_counts, diseased_counts])
sample_ids = [f'H{i}' for i in range(10)] + [f'D{i}' for i in range(10)]
grouping = pd.Series(['healthy'] * 10 + ['diseased'] * 10, index=sample_ids)

# Alpha diversity comparison
metrics = ['shannon', 'observed_otus', 'simpson']
for metric in metrics:
    alpha = alpha_diversity(metric, counts, ids=sample_ids)
    h_mean = alpha[grouping == 'healthy'].mean()
    d_mean = alpha[grouping == 'diseased'].mean()
    print(f"{metric}: healthy={h_mean:.3f}, diseased={d_mean:.3f}")

# Beta diversity + PCoA
bc_dm = beta_diversity('braycurtis', counts, ids=sample_ids)

# PERMANOVA test for group differences
result = permanova(bc_dm, grouping, permutations=999)
print(f"\nPERMANOVA: F={result['test statistic']:.3f}, "
      f"p={result['p-value']:.4f}")

# Ordination plot
pcoa_results = pcoa(bc_dm)
samples = pcoa_results.samples
prop_explained = pcoa_results.proportion_explained

fig, ax = plt.subplots(figsize=(6, 5))
for group, color in [('healthy', '#2ecc71'), ('diseased', '#e74c3c')]:
    mask = grouping == group
    ax.scatter(samples.loc[mask, 'PC1'], samples.loc[mask, 'PC2'],
               c=color, label=group, s=60, edgecolor='white', linewidth=0.5)
ax.set_xlabel(f'PC1 ({prop_explained.iloc[0]:.1%} variance)')
ax.set_ylabel(f'PC2 ({prop_explained.iloc[1]:.1%} variance)')
ax.set_title('PCoA of Bray-Curtis Distances')
ax.legend()
fig.tight_layout()
fig.savefig('pcoa_plot.pdf')
print("PCoA plot saved")

Phylogenetic Analysis


from skbio import TreeNode, DistanceMatrix
from skbio.tree import nj

# Build a neighbor-joining tree from a distance matrix
ids = ['E_coli', 'S_aureus', 'B_subtilis', 'P_aeruginosa', 'M_tuberculosis']
dm_data = [
    [0.0, 0.7, 0.3, 0.8, 0.9],
    [0.7, 0.0, 0.6, 0.85, 0.75],
    [0.3, 0.6, 0.0, 0.7, 0.8],
    [0.8, 0.85, 0.7, 0.0, 0.65],
    [0.9, 0.75, 0.8, 0.65, 0.0],
]
dm = DistanceMatrix(dm_data, ids=ids)

# Construct tree
tree = nj(dm)
print("Newick format:")
print(tree)

# Tree operations
print(f"\nTotal tips: {len(list(tree.tips()))}")
print(f"Total nodes: {tree.count()}")

# Find subtree
for tip in tree.tips():
    dist_to_root = tip.distance(tree)
    print(f"  {tip.name}: distance to root = {dist_to_root:.3f}")

# Read/write Newick
tree.write('/tmp/phylo_tree.nwk', format='newick')
loaded_tree = TreeNode.read('/tmp/phylo_tree.nwk', format='newick')

Configuration

Parameter	Description	Default
`alpha_metric`	Alpha diversity metric	`"shannon"`
`beta_metric`	Beta diversity distance metric	`"braycurtis"`
`permutations`	Number of permutations for statistical tests	`999`
`n_components`	Number of ordination axes to retain	`3`
`sequence_type`	Default sequence class (DNA, RNA, Protein)	`"DNA"`
`validate_sequences`	Validate characters on sequence creation	`true`
`tree_format`	Phylogenetic tree I/O format	`"newick"`
`random_seed`	Seed for reproducible permutation tests	`None`

Best Practices

Rarefy count data before diversity analysis — Uneven sequencing depth across samples biases diversity metrics. Rarefy to the minimum sample depth or use rarefaction curves to choose an appropriate depth. Never compare Shannon diversity between samples with vastly different total counts without rarefaction.
Use appropriate distance metrics for your data type — Bray-Curtis for abundance data, Jaccard for presence/absence, UniFrac for phylogeny-aware distances. Euclidean distance on raw counts is almost always wrong for ecological data because it overweights abundant species.
Run PERMANOVA with sufficient permutations — Use at least 999 permutations (9999 for publication). Check the p-value but also report the R² (test statistic / total sum of squares) as a measure of effect size. PERMANOVA assumes equal group dispersions — verify with betadisper tests.
Validate sequences on import — Set validate=True when creating sequence objects to catch invalid characters early. DNA sequences with ambiguity codes (N, R, Y) need DNA(seq, validate=True) which accepts IUPAC characters. Invalid characters cause silent errors in downstream analysis.
Cache distance matrices for iterative analysis — Distance matrix computation on large datasets (>1000 samples) is slow. Save computed matrices with dm.write('/path/to/dm.txt') and reload with DistanceMatrix.read() rather than recomputing for each analysis or visualization tweak.

Common Issues

"ValueError: Counts must be non-negative integers" in diversity functions — Alpha and beta diversity functions require integer count data, not relative abundances or floats. If your data is in relative abundance format, multiply by a scaling factor and round: counts = np.round(rel_abundance * 10000).astype(int).

Distance matrix IDs don't match grouping variable — PERMANOVA and ANOSIM require the distance matrix sample IDs to exactly match the grouping Series index. Verify with set(dm.ids) == set(grouping.index). Mismatches often come from leading/trailing whitespace in sample names — strip both before analysis.

TreeNode operations fail with "has no attribute" errors — scikit-bio tree methods differ from ETE3 or dendropy. The API uses tree.tips() (not get_leaves()), tree.count() (not len), and node.distance(other) (not get_distance()). Check the scikit-bio TreeNode documentation rather than assuming compatibility with other tree libraries.

⚠️ Loading Issue

Scikit Bio Complete

Scikit Bio Complete

When to Use This Skill

Quick Start

Core Concepts

Module Overview

Microbial Ecology Pipeline

Phylogenetic Analysis

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace