G

Geniml Dynamic

Powerful skill for skill, should, used, working. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Geniml Dynamic

A scientific computing skill for machine learning on genomic interval data using Geniml — the Python package from the Sheffield Computational Biology Group for building ML models on BED files, region sets, and genomic interval collections.

When to Use This Skill

Choose Geniml Dynamic when:

  • Building machine learning models on genomic regions (BED files)
  • Computing embeddings for genomic interval sets
  • Performing region set similarity and clustering analysis
  • Working with collections of BED files for integrative analysis

Consider alternatives when:

  • You need peak calling from BAM files (use MACS2)
  • You need differential binding analysis (use DiffBind)
  • You need general genomic arithmetic (use bedtools)
  • You need deep learning on sequences (use Enformer or similar)

Quick Start

claude "Compute embeddings for a collection of BED files and cluster them"
from geniml.region2vec import Region2VecExModel from geniml.io import RegionSet import numpy as np # Load region sets (BED files) region_sets = [ RegionSet("sample1_peaks.bed"), RegionSet("sample2_peaks.bed"), RegionSet("sample3_peaks.bed"), ] # Train Region2Vec model model = Region2VecExModel() model.train(region_sets, epochs=100, embedding_dim=100) # Get embeddings for each region set embeddings = [] for rs in region_sets: emb = model.encode(rs) embeddings.append(emb) embeddings = np.array(embeddings) print(f"Embedding matrix: {embeddings.shape}") # Compute pairwise similarities from sklearn.metrics.pairwise import cosine_similarity similarity = cosine_similarity(embeddings) print("Similarity matrix:") print(similarity)

Core Concepts

Geniml Modules

ModulePurposeInput
region2vecLearn region embeddingsBED files
evalEvaluate region set modelsEmbeddings
ioRead/write genomic regionsBED, narrowPeak
tokenizationTokenize genome into regionsGenome assembly
searchSearch similar region setsQuery BED + database

Region Tokenization

from geniml.tokenization import TreeTokenizer # Create genome tokenizer tokenizer = TreeTokenizer() tokenizer.fit("hg38.chrom.sizes", resolution=1000) # Tokenize a BED file tokens = tokenizer.tokenize("peaks.bed") print(f"Regions: {len(tokens)}") print(f"Unique tokens: {len(set(tokens))}") # Tokens can be used as input for NLP-style models
from geniml.search import BED2BEDSearch # Build search index from a collection search = BED2BEDSearch() search.build_index([ "tf_binding_1.bed", "tf_binding_2.bed", "histone_marks.bed", "open_chromatin.bed", ]) # Query with a new BED file results = search.query("my_peaks.bed", top_k=5) for match in results: print(f"{match.name}: similarity = {match.score:.3f}")

Configuration

ParameterDescriptionDefault
embedding_dimDimensionality of region embeddings100
epochsTraining epochs for Region2Vec100
resolutionGenome tokenization resolution (bp)1000
genomeReference genome assemblyhg38
similarity_metricCosine, Jaccard, or overlapcosine

Best Practices

  1. Use consistent peak calling parameters. When comparing region sets, ensure they were generated with the same peak caller and parameters. Different peak calling settings produce incomparable region sets that confound downstream ML analysis.

  2. Normalize region sets before embedding. Region sets of very different sizes (100 vs. 100,000 regions) produce embeddings dominated by set size rather than biological content. Consider subsampling large sets or using size-normalized embedding methods.

  3. Train on diverse region sets for general embeddings. Region2Vec models learn more useful representations when trained on diverse data (multiple cell types, assays, and conditions). A model trained on only one cell type won't generalize well.

  4. Validate embeddings with known biology. After computing embeddings, verify that biologically related region sets (same cell type, same TF family) cluster together. If the clustering doesn't match known biology, the model may need retraining or hyperparameter adjustment.

  5. Use appropriate resolution for tokenization. 1kb resolution works for broad histone marks, but narrow TF binding sites may need 100-200bp resolution. Match the tokenization resolution to the expected width of your genomic features.

Common Issues

Embeddings don't separate known groups. The embedding dimension may be too low or training epochs too few. Increase embedding_dim to 200 and epochs to 500. Also check that input BED files are properly formatted (chrom, start, end columns).

Memory error with large region set collections. Training on thousands of BED files with millions of regions requires significant memory. Subsample regions (keep top N peaks by score) or process in batches. Use disk-backed tokenization for very large collections.

Search returns unexpected matches. Genomic regions near telomeres, centromeres, and repeat-rich regions create spurious similarities. Filter blacklisted regions (ENCODE blacklist) from all BED files before computing embeddings or similarity.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates