Geniml Dynamic
Powerful skill for skill, should, used, working. Includes structured workflows, validation checks, and reusable patterns for scientific.
Geniml Dynamic
A scientific computing skill for machine learning on genomic interval data using Geniml — the Python package from the Sheffield Computational Biology Group for building ML models on BED files, region sets, and genomic interval collections.
When to Use This Skill
Choose Geniml Dynamic when:
- Building machine learning models on genomic regions (BED files)
- Computing embeddings for genomic interval sets
- Performing region set similarity and clustering analysis
- Working with collections of BED files for integrative analysis
Consider alternatives when:
- You need peak calling from BAM files (use MACS2)
- You need differential binding analysis (use DiffBind)
- You need general genomic arithmetic (use bedtools)
- You need deep learning on sequences (use Enformer or similar)
Quick Start
claude "Compute embeddings for a collection of BED files and cluster them"
from geniml.region2vec import Region2VecExModel from geniml.io import RegionSet import numpy as np # Load region sets (BED files) region_sets = [ RegionSet("sample1_peaks.bed"), RegionSet("sample2_peaks.bed"), RegionSet("sample3_peaks.bed"), ] # Train Region2Vec model model = Region2VecExModel() model.train(region_sets, epochs=100, embedding_dim=100) # Get embeddings for each region set embeddings = [] for rs in region_sets: emb = model.encode(rs) embeddings.append(emb) embeddings = np.array(embeddings) print(f"Embedding matrix: {embeddings.shape}") # Compute pairwise similarities from sklearn.metrics.pairwise import cosine_similarity similarity = cosine_similarity(embeddings) print("Similarity matrix:") print(similarity)
Core Concepts
Geniml Modules
| Module | Purpose | Input |
|---|---|---|
region2vec | Learn region embeddings | BED files |
eval | Evaluate region set models | Embeddings |
io | Read/write genomic regions | BED, narrowPeak |
tokenization | Tokenize genome into regions | Genome assembly |
search | Search similar region sets | Query BED + database |
Region Tokenization
from geniml.tokenization import TreeTokenizer # Create genome tokenizer tokenizer = TreeTokenizer() tokenizer.fit("hg38.chrom.sizes", resolution=1000) # Tokenize a BED file tokens = tokenizer.tokenize("peaks.bed") print(f"Regions: {len(tokens)}") print(f"Unique tokens: {len(set(tokens))}") # Tokens can be used as input for NLP-style models
Region Set Search
from geniml.search import BED2BEDSearch # Build search index from a collection search = BED2BEDSearch() search.build_index([ "tf_binding_1.bed", "tf_binding_2.bed", "histone_marks.bed", "open_chromatin.bed", ]) # Query with a new BED file results = search.query("my_peaks.bed", top_k=5) for match in results: print(f"{match.name}: similarity = {match.score:.3f}")
Configuration
| Parameter | Description | Default |
|---|---|---|
embedding_dim | Dimensionality of region embeddings | 100 |
epochs | Training epochs for Region2Vec | 100 |
resolution | Genome tokenization resolution (bp) | 1000 |
genome | Reference genome assembly | hg38 |
similarity_metric | Cosine, Jaccard, or overlap | cosine |
Best Practices
-
Use consistent peak calling parameters. When comparing region sets, ensure they were generated with the same peak caller and parameters. Different peak calling settings produce incomparable region sets that confound downstream ML analysis.
-
Normalize region sets before embedding. Region sets of very different sizes (100 vs. 100,000 regions) produce embeddings dominated by set size rather than biological content. Consider subsampling large sets or using size-normalized embedding methods.
-
Train on diverse region sets for general embeddings. Region2Vec models learn more useful representations when trained on diverse data (multiple cell types, assays, and conditions). A model trained on only one cell type won't generalize well.
-
Validate embeddings with known biology. After computing embeddings, verify that biologically related region sets (same cell type, same TF family) cluster together. If the clustering doesn't match known biology, the model may need retraining or hyperparameter adjustment.
-
Use appropriate resolution for tokenization. 1kb resolution works for broad histone marks, but narrow TF binding sites may need 100-200bp resolution. Match the tokenization resolution to the expected width of your genomic features.
Common Issues
Embeddings don't separate known groups. The embedding dimension may be too low or training epochs too few. Increase embedding_dim to 200 and epochs to 500. Also check that input BED files are properly formatted (chrom, start, end columns).
Memory error with large region set collections. Training on thousands of BED files with millions of regions requires significant memory. Subsample regions (keep top N peaks by score) or process in batches. Use disk-backed tokenization for very large collections.
Search returns unexpected matches. Genomic regions near telomeres, centromeres, and repeat-rich regions create spurious similarities. Filter blacklisted regions (ENCODE blacklist) from all BED files before computing embeddings or similarity.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.