S

Scikit Bio Complete

Enterprise-grade skill for biological, data, toolkit, sequence. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Scikit Bio Complete

Perform bioinformatics analyses with scikit-bio, a Python library for biological sequence manipulation, alignment, phylogenetics, microbial ecology diversity metrics, and multivariate ordination. This skill covers DNA/RNA/protein sequence operations, distance matrix computation, phylogenetic tree construction, and ecological diversity analysis.

When to Use This Skill

Choose Scikit Bio Complete when you need to:

  • Manipulate, align, and analyze biological sequences (DNA, RNA, protein)
  • Compute alpha and beta diversity metrics for microbial community data
  • Build distance matrices and perform ordination (PCoA, NMDS) for ecological data
  • Construct and manipulate phylogenetic trees programmatically

Consider alternatives when:

  • You need high-throughput sequencing pipeline processing (use Snakemake + QIIME2)
  • You need single-cell RNA-seq analysis (use Scanpy)
  • You need protein structure prediction or analysis (use BioPython + AlphaFold)

Quick Start

pip install scikit-bio numpy pandas matplotlib
import skbio from skbio import DNA, RNA, Protein, TabularMSA from skbio import DistanceMatrix from skbio.diversity import alpha_diversity, beta_diversity import numpy as np # Sequence manipulation seq = DNA("ATGCGATCGATCGATCG") print(f"Sequence: {seq}") print(f"Length: {len(seq)}") print(f"GC content: {seq.gc_content():.2%}") print(f"Complement: {seq.complement()}") print(f"Reverse complement: {seq.reverse_complement()}") # Transcribe to RNA rna = seq.transcribe() print(f"RNA: {rna}") # Alpha diversity on OTU table counts = np.array([ [10, 20, 5, 0, 3], # Sample 1 [0, 15, 25, 10, 1], # Sample 2 [30, 5, 0, 0, 8], # Sample 3 ]) sample_ids = ['S1', 'S2', 'S3'] # Calculate Shannon diversity shannon = alpha_diversity('shannon', counts, ids=sample_ids) print(f"\nShannon diversity:\n{shannon}") # Calculate Bray-Curtis beta diversity bc_dm = beta_diversity('braycurtis', counts, ids=sample_ids) print(f"\nBray-Curtis distance matrix:\n{bc_dm}")

Core Concepts

Module Overview

ModulePurposeKey Functions
skbio.sequenceSequence types and operationsDNA, RNA, Protein, GeneticCode
skbio.alignmentSequence alignmentTabularMSA, local_pairwise_align_ssw
skbio.diversityEcological diversity metricsalpha_diversity, beta_diversity
skbio.stats.distanceDistance matrix operationsDistanceMatrix, permanova, anosim
skbio.stats.ordinationOrdination methodspcoa, rda, cca
skbio.treePhylogenetic treesTreeNode
skbio.ioFile format I/OFASTA, FASTQ, Newick, phyloXML

Microbial Ecology Pipeline

import skbio from skbio.diversity import alpha_diversity, beta_diversity from skbio.stats.distance import permanova, anosim from skbio.stats.ordination import pcoa import numpy as np import pandas as pd import matplotlib.pyplot as plt # Simulate OTU count data (samples × species) np.random.seed(42) n_samples = 20 n_species = 50 # Two groups: healthy (10) and diseased (10) healthy_counts = np.random.poisson(lam=5, size=(10, n_species)) diseased_counts = np.random.poisson(lam=3, size=(10, n_species)) # Increase abundance of some taxa in diseased diseased_counts[:, :5] *= 3 counts = np.vstack([healthy_counts, diseased_counts]) sample_ids = [f'H{i}' for i in range(10)] + [f'D{i}' for i in range(10)] grouping = pd.Series(['healthy'] * 10 + ['diseased'] * 10, index=sample_ids) # Alpha diversity comparison metrics = ['shannon', 'observed_otus', 'simpson'] for metric in metrics: alpha = alpha_diversity(metric, counts, ids=sample_ids) h_mean = alpha[grouping == 'healthy'].mean() d_mean = alpha[grouping == 'diseased'].mean() print(f"{metric}: healthy={h_mean:.3f}, diseased={d_mean:.3f}") # Beta diversity + PCoA bc_dm = beta_diversity('braycurtis', counts, ids=sample_ids) # PERMANOVA test for group differences result = permanova(bc_dm, grouping, permutations=999) print(f"\nPERMANOVA: F={result['test statistic']:.3f}, " f"p={result['p-value']:.4f}") # Ordination plot pcoa_results = pcoa(bc_dm) samples = pcoa_results.samples prop_explained = pcoa_results.proportion_explained fig, ax = plt.subplots(figsize=(6, 5)) for group, color in [('healthy', '#2ecc71'), ('diseased', '#e74c3c')]: mask = grouping == group ax.scatter(samples.loc[mask, 'PC1'], samples.loc[mask, 'PC2'], c=color, label=group, s=60, edgecolor='white', linewidth=0.5) ax.set_xlabel(f'PC1 ({prop_explained.iloc[0]:.1%} variance)') ax.set_ylabel(f'PC2 ({prop_explained.iloc[1]:.1%} variance)') ax.set_title('PCoA of Bray-Curtis Distances') ax.legend() fig.tight_layout() fig.savefig('pcoa_plot.pdf') print("PCoA plot saved")

Phylogenetic Analysis

from skbio import TreeNode, DistanceMatrix from skbio.tree import nj # Build a neighbor-joining tree from a distance matrix ids = ['E_coli', 'S_aureus', 'B_subtilis', 'P_aeruginosa', 'M_tuberculosis'] dm_data = [ [0.0, 0.7, 0.3, 0.8, 0.9], [0.7, 0.0, 0.6, 0.85, 0.75], [0.3, 0.6, 0.0, 0.7, 0.8], [0.8, 0.85, 0.7, 0.0, 0.65], [0.9, 0.75, 0.8, 0.65, 0.0], ] dm = DistanceMatrix(dm_data, ids=ids) # Construct tree tree = nj(dm) print("Newick format:") print(tree) # Tree operations print(f"\nTotal tips: {len(list(tree.tips()))}") print(f"Total nodes: {tree.count()}") # Find subtree for tip in tree.tips(): dist_to_root = tip.distance(tree) print(f" {tip.name}: distance to root = {dist_to_root:.3f}") # Read/write Newick tree.write('/tmp/phylo_tree.nwk', format='newick') loaded_tree = TreeNode.read('/tmp/phylo_tree.nwk', format='newick')

Configuration

ParameterDescriptionDefault
alpha_metricAlpha diversity metric"shannon"
beta_metricBeta diversity distance metric"braycurtis"
permutationsNumber of permutations for statistical tests999
n_componentsNumber of ordination axes to retain3
sequence_typeDefault sequence class (DNA, RNA, Protein)"DNA"
validate_sequencesValidate characters on sequence creationtrue
tree_formatPhylogenetic tree I/O format"newick"
random_seedSeed for reproducible permutation testsNone

Best Practices

  1. Rarefy count data before diversity analysis — Uneven sequencing depth across samples biases diversity metrics. Rarefy to the minimum sample depth or use rarefaction curves to choose an appropriate depth. Never compare Shannon diversity between samples with vastly different total counts without rarefaction.

  2. Use appropriate distance metrics for your data type — Bray-Curtis for abundance data, Jaccard for presence/absence, UniFrac for phylogeny-aware distances. Euclidean distance on raw counts is almost always wrong for ecological data because it overweights abundant species.

  3. Run PERMANOVA with sufficient permutations — Use at least 999 permutations (9999 for publication). Check the p-value but also report the R² (test statistic / total sum of squares) as a measure of effect size. PERMANOVA assumes equal group dispersions — verify with betadisper tests.

  4. Validate sequences on import — Set validate=True when creating sequence objects to catch invalid characters early. DNA sequences with ambiguity codes (N, R, Y) need DNA(seq, validate=True) which accepts IUPAC characters. Invalid characters cause silent errors in downstream analysis.

  5. Cache distance matrices for iterative analysis — Distance matrix computation on large datasets (>1000 samples) is slow. Save computed matrices with dm.write('/path/to/dm.txt') and reload with DistanceMatrix.read() rather than recomputing for each analysis or visualization tweak.

Common Issues

"ValueError: Counts must be non-negative integers" in diversity functions — Alpha and beta diversity functions require integer count data, not relative abundances or floats. If your data is in relative abundance format, multiply by a scaling factor and round: counts = np.round(rel_abundance * 10000).astype(int).

Distance matrix IDs don't match grouping variable — PERMANOVA and ANOSIM require the distance matrix sample IDs to exactly match the grouping Series index. Verify with set(dm.ids) == set(grouping.index). Mismatches often come from leading/trailing whitespace in sample names — strip both before analysis.

TreeNode operations fail with "has no attribute" errors — scikit-bio tree methods differ from ETE3 or dendropy. The API uses tree.tips() (not get_leaves()), tree.count() (not len), and node.distance(other) (not get_distance()). Check the scikit-bio TreeNode documentation rather than assuming compatibility with other tree libraries.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates