Arboreto Engine
Boost productivity using this infer, gene, regulatory, networks. Includes structured workflows, validation checks, and reusable patterns for scientific.
Arboreto Engine
A scientific computing skill for gene regulatory network inference using Arboreto — a Python library implementing parallelized algorithms like GRNBoost2 and GENIE3 that scale from single machines to multi-node clusters via Dask.
When to Use This Skill
Choose Arboreto Engine when:
- Inferring gene regulatory networks (GRNs) from expression data
- Running GRNBoost2 or GENIE3 on large-scale scRNA-seq datasets
- Building transcription factor-target gene interaction networks
- Scaling GRN inference across multi-core machines or Dask clusters
Consider alternatives when:
- You need full single-cell analysis (use Scanpy)
- You want regulatory network visualization (use Cytoscape or NetworkX)
- You need ChIP-seq based regulatory analysis (use MACS2 or Homer)
- You're doing simple correlation analysis (use pandas/scipy directly)
Quick Start
claude "Infer a gene regulatory network from my expression data using GRNBoost2"
import pandas as pd from arboreto.algo import grnboost2, genie3 from arboreto.utils import load_tf_names # Load expression data (genes as columns, cells as rows) expression = pd.read_csv("expression_matrix.csv", index_col=0) print(f"Expression matrix: {expression.shape[0]} cells × {expression.shape[1]} genes") # Load transcription factor list tf_names = load_tf_names("tf_list.txt") print(f"Transcription factors: {len(tf_names)}") # Run GRNBoost2 (faster than GENIE3) network = grnboost2( expression_data=expression, tf_names=tf_names, verbose=True ) # Results: DataFrame with TF, target, importance columns print(network.head(10)) # Filter high-confidence edges top_edges = network[network["importance"] > network["importance"].quantile(0.99)] print(f"Top 1% edges: {len(top_edges)}") # Save network network.to_csv("grn_output.csv", index=False)
Core Concepts
Algorithm Comparison
| Feature | GRNBoost2 | GENIE3 |
|---|---|---|
| Base model | Gradient boosting (XGBoost) | Random forest |
| Speed | Fast (optimized early stopping) | Slower |
| Scalability | Excellent with Dask | Good with Dask |
| Accuracy | Comparable to GENIE3 | Gold standard |
| Memory | Lower | Higher |
| Recommended for | Large datasets (>10K cells) | Smaller datasets, benchmarking |
Dask-Parallel Execution
from distributed import Client from arboreto.algo import grnboost2 # Start Dask local cluster client = Client(n_workers=8, threads_per_worker=1) print(f"Dask dashboard: {client.dashboard_link}") # Run GRNBoost2 with Dask parallelization network = grnboost2( expression_data=expression, tf_names=tf_names, client_or_address=client, verbose=True ) client.close()
SCENIC Integration
# Arboreto is Step 1 of the SCENIC pipeline: # 1. GRN inference (Arboreto/GRNBoost2) # 2. Regulon prediction (RcisTarget — motif enrichment) # 3. Cellular enrichment (AUCell — activity scoring) from pyscenic.utils import modules_from_adjacencies # Convert Arboreto output to SCENIC modules adjacencies = network # From grnboost2() modules = list(modules_from_adjacencies(adjacencies, expression)) print(f"Inferred {len(modules)} regulatory modules")
Configuration
| Parameter | Description | Default |
|---|---|---|
tf_names | List of transcription factor gene names | All genes (if not specified) |
client_or_address | Dask client for parallelization | None (local) |
early_stop_window_length | GRNBoost2 early stopping window | 25 |
seed | Random seed for reproducibility | None |
verbose | Print progress during inference | False |
Best Practices
-
Provide a curated transcription factor list. Running GRNBoost2 without a TF list treats every gene as a potential regulator, massively increasing computation time. Use organism-specific TF databases (e.g., AnimalTFDB for human, PlantTFDB for plants).
-
Filter lowly expressed genes before inference. Remove genes expressed in fewer than 5-10% of cells. Noisy, sparse genes produce unreliable regulatory edges and increase computation time without improving network quality.
-
Use GRNBoost2 for large datasets, GENIE3 for benchmarking. GRNBoost2 is 10-50x faster than GENIE3 with comparable accuracy. Use GENIE3 when you need to benchmark against the literature or when dataset size is small enough that speed doesn't matter.
-
Set a random seed for reproducibility. Both algorithms involve randomness (tree-based ensemble methods). Setting
seed=42ensures identical results across runs, which is essential for reproducible research. -
Validate networks with independent data. GRN inference produces many false positives. Validate top-ranked edges against known regulatory interactions (RegNetwork, TRRUST) or orthogonal data (ChIP-seq, perturbation experiments).
Common Issues
GRNBoost2 runs out of memory on large datasets. Reduce the gene count by filtering to highly variable genes (2000-5000). For datasets with >50K cells, subsample to 10-20K representative cells. The regulatory network structure is typically captured with fewer cells than needed for clustering.
Network has too many low-confidence edges. GRNBoost2 outputs an edge for every TF-target pair with non-zero importance. Filter to the top 1-5% of edges by importance score, or use a hard threshold based on your downstream analysis needs. The raw output is not a final network — it requires thresholding.
Dask cluster fails with serialization errors. Ensure the expression data is a pandas DataFrame (not AnnData or sparse matrix). Arboreto expects dense DataFrames. Convert with expression = pd.DataFrame(adata.X.toarray(), index=adata.obs_names, columns=adata.var_names) if starting from AnnData.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.