A

Arboreto Engine

Boost productivity using this infer, gene, regulatory, networks. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Arboreto Engine

A scientific computing skill for gene regulatory network inference using Arboreto — a Python library implementing parallelized algorithms like GRNBoost2 and GENIE3 that scale from single machines to multi-node clusters via Dask.

When to Use This Skill

Choose Arboreto Engine when:

  • Inferring gene regulatory networks (GRNs) from expression data
  • Running GRNBoost2 or GENIE3 on large-scale scRNA-seq datasets
  • Building transcription factor-target gene interaction networks
  • Scaling GRN inference across multi-core machines or Dask clusters

Consider alternatives when:

  • You need full single-cell analysis (use Scanpy)
  • You want regulatory network visualization (use Cytoscape or NetworkX)
  • You need ChIP-seq based regulatory analysis (use MACS2 or Homer)
  • You're doing simple correlation analysis (use pandas/scipy directly)

Quick Start

claude "Infer a gene regulatory network from my expression data using GRNBoost2"
import pandas as pd from arboreto.algo import grnboost2, genie3 from arboreto.utils import load_tf_names # Load expression data (genes as columns, cells as rows) expression = pd.read_csv("expression_matrix.csv", index_col=0) print(f"Expression matrix: {expression.shape[0]} cells × {expression.shape[1]} genes") # Load transcription factor list tf_names = load_tf_names("tf_list.txt") print(f"Transcription factors: {len(tf_names)}") # Run GRNBoost2 (faster than GENIE3) network = grnboost2( expression_data=expression, tf_names=tf_names, verbose=True ) # Results: DataFrame with TF, target, importance columns print(network.head(10)) # Filter high-confidence edges top_edges = network[network["importance"] > network["importance"].quantile(0.99)] print(f"Top 1% edges: {len(top_edges)}") # Save network network.to_csv("grn_output.csv", index=False)

Core Concepts

Algorithm Comparison

FeatureGRNBoost2GENIE3
Base modelGradient boosting (XGBoost)Random forest
SpeedFast (optimized early stopping)Slower
ScalabilityExcellent with DaskGood with Dask
AccuracyComparable to GENIE3Gold standard
MemoryLowerHigher
Recommended forLarge datasets (>10K cells)Smaller datasets, benchmarking

Dask-Parallel Execution

from distributed import Client from arboreto.algo import grnboost2 # Start Dask local cluster client = Client(n_workers=8, threads_per_worker=1) print(f"Dask dashboard: {client.dashboard_link}") # Run GRNBoost2 with Dask parallelization network = grnboost2( expression_data=expression, tf_names=tf_names, client_or_address=client, verbose=True ) client.close()

SCENIC Integration

# Arboreto is Step 1 of the SCENIC pipeline: # 1. GRN inference (Arboreto/GRNBoost2) # 2. Regulon prediction (RcisTarget — motif enrichment) # 3. Cellular enrichment (AUCell — activity scoring) from pyscenic.utils import modules_from_adjacencies # Convert Arboreto output to SCENIC modules adjacencies = network # From grnboost2() modules = list(modules_from_adjacencies(adjacencies, expression)) print(f"Inferred {len(modules)} regulatory modules")

Configuration

ParameterDescriptionDefault
tf_namesList of transcription factor gene namesAll genes (if not specified)
client_or_addressDask client for parallelizationNone (local)
early_stop_window_lengthGRNBoost2 early stopping window25
seedRandom seed for reproducibilityNone
verbosePrint progress during inferenceFalse

Best Practices

  1. Provide a curated transcription factor list. Running GRNBoost2 without a TF list treats every gene as a potential regulator, massively increasing computation time. Use organism-specific TF databases (e.g., AnimalTFDB for human, PlantTFDB for plants).

  2. Filter lowly expressed genes before inference. Remove genes expressed in fewer than 5-10% of cells. Noisy, sparse genes produce unreliable regulatory edges and increase computation time without improving network quality.

  3. Use GRNBoost2 for large datasets, GENIE3 for benchmarking. GRNBoost2 is 10-50x faster than GENIE3 with comparable accuracy. Use GENIE3 when you need to benchmark against the literature or when dataset size is small enough that speed doesn't matter.

  4. Set a random seed for reproducibility. Both algorithms involve randomness (tree-based ensemble methods). Setting seed=42 ensures identical results across runs, which is essential for reproducible research.

  5. Validate networks with independent data. GRN inference produces many false positives. Validate top-ranked edges against known regulatory interactions (RegNetwork, TRRUST) or orthogonal data (ChIP-seq, perturbation experiments).

Common Issues

GRNBoost2 runs out of memory on large datasets. Reduce the gene count by filtering to highly variable genes (2000-5000). For datasets with >50K cells, subsample to 10-20K representative cells. The regulatory network structure is typically captured with fewer cells than needed for clustering.

Network has too many low-confidence edges. GRNBoost2 outputs an edge for every TF-target pair with non-zero importance. Filter to the top 1-5% of edges by importance score, or use a hard threshold based on your downstream analysis needs. The raw output is not a final network — it requires thresholding.

Dask cluster fails with serialization errors. Ensure the expression data is a pandas DataFrame (not AnnData or sparse matrix). Arboreto expects dense DataFrames. Convert with expression = pd.DataFrame(adata.X.toarray(), index=adata.obs_names, columns=adata.var_names) if starting from AnnData.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates