Arboreto Engine

A scientific computing skill for gene regulatory network inference using Arboreto — a Python library implementing parallelized algorithms like GRNBoost2 and GENIE3 that scale from single machines to multi-node clusters via Dask.

When to Use This Skill

Choose Arboreto Engine when:

Inferring gene regulatory networks (GRNs) from expression data
Running GRNBoost2 or GENIE3 on large-scale scRNA-seq datasets
Building transcription factor-target gene interaction networks
Scaling GRN inference across multi-core machines or Dask clusters

Consider alternatives when:

You need full single-cell analysis (use Scanpy)
You want regulatory network visualization (use Cytoscape or NetworkX)
You need ChIP-seq based regulatory analysis (use MACS2 or Homer)
You're doing simple correlation analysis (use pandas/scipy directly)

Quick Start


claude "Infer a gene regulatory network from my expression data using GRNBoost2"


import pandas as pd
from arboreto.algo import grnboost2, genie3
from arboreto.utils import load_tf_names

# Load expression data (genes as columns, cells as rows)
expression = pd.read_csv("expression_matrix.csv", index_col=0)
print(f"Expression matrix: {expression.shape[0]} cells × {expression.shape[1]} genes")

# Load transcription factor list
tf_names = load_tf_names("tf_list.txt")
print(f"Transcription factors: {len(tf_names)}")

# Run GRNBoost2 (faster than GENIE3)
network = grnboost2(
    expression_data=expression,
    tf_names=tf_names,
    verbose=True
)

# Results: DataFrame with TF, target, importance columns
print(network.head(10))
# Filter high-confidence edges
top_edges = network[network["importance"] > network["importance"].quantile(0.99)]
print(f"Top 1% edges: {len(top_edges)}")

# Save network
network.to_csv("grn_output.csv", index=False)

Core Concepts

Algorithm Comparison

Feature	GRNBoost2	GENIE3
Base model	Gradient boosting (XGBoost)	Random forest
Speed	Fast (optimized early stopping)	Slower
Scalability	Excellent with Dask	Good with Dask
Accuracy	Comparable to GENIE3	Gold standard
Memory	Lower	Higher
Recommended for	Large datasets (>10K cells)	Smaller datasets, benchmarking

Dask-Parallel Execution


from distributed import Client
from arboreto.algo import grnboost2

# Start Dask local cluster
client = Client(n_workers=8, threads_per_worker=1)
print(f"Dask dashboard: {client.dashboard_link}")

# Run GRNBoost2 with Dask parallelization
network = grnboost2(
    expression_data=expression,
    tf_names=tf_names,
    client_or_address=client,
    verbose=True
)

client.close()

SCENIC Integration


# Arboreto is Step 1 of the SCENIC pipeline:
# 1. GRN inference (Arboreto/GRNBoost2)
# 2. Regulon prediction (RcisTarget — motif enrichment)
# 3. Cellular enrichment (AUCell — activity scoring)

from pyscenic.utils import modules_from_adjacencies

# Convert Arboreto output to SCENIC modules
adjacencies = network  # From grnboost2()
modules = list(modules_from_adjacencies(adjacencies, expression))
print(f"Inferred {len(modules)} regulatory modules")

Configuration

Parameter	Description	Default
`tf_names`	List of transcription factor gene names	All genes (if not specified)
`client_or_address`	Dask client for parallelization	`None` (local)
`early_stop_window_length`	GRNBoost2 early stopping window	`25`
`seed`	Random seed for reproducibility	`None`
`verbose`	Print progress during inference	`False`

Best Practices

Provide a curated transcription factor list. Running GRNBoost2 without a TF list treats every gene as a potential regulator, massively increasing computation time. Use organism-specific TF databases (e.g., AnimalTFDB for human, PlantTFDB for plants).
Filter lowly expressed genes before inference. Remove genes expressed in fewer than 5-10% of cells. Noisy, sparse genes produce unreliable regulatory edges and increase computation time without improving network quality.
Use GRNBoost2 for large datasets, GENIE3 for benchmarking. GRNBoost2 is 10-50x faster than GENIE3 with comparable accuracy. Use GENIE3 when you need to benchmark against the literature or when dataset size is small enough that speed doesn't matter.
Set a random seed for reproducibility. Both algorithms involve randomness (tree-based ensemble methods). Setting seed=42 ensures identical results across runs, which is essential for reproducible research.
Validate networks with independent data. GRN inference produces many false positives. Validate top-ranked edges against known regulatory interactions (RegNetwork, TRRUST) or orthogonal data (ChIP-seq, perturbation experiments).

Common Issues

GRNBoost2 runs out of memory on large datasets. Reduce the gene count by filtering to highly variable genes (2000-5000). For datasets with >50K cells, subsample to 10-20K representative cells. The regulatory network structure is typically captured with fewer cells than needed for clustering.

Network has too many low-confidence edges. GRNBoost2 outputs an edge for every TF-target pair with non-zero importance. Filter to the top 1-5% of edges by importance score, or use a hard threshold based on your downstream analysis needs. The raw output is not a final network — it requires thresholding.

Dask cluster fails with serialization errors. Ensure the expression data is a pandas DataFrame (not AnnData or sparse matrix). Arboreto expects dense DataFrames. Convert with expression = pd.DataFrame(adata.X.toarray(), index=adata.obs_names, columns=adata.var_names) if starting from AnnData.

⚠️ Loading Issue

Arboreto Engine

Arboreto Engine

When to Use This Skill

Quick Start

Core Concepts

Algorithm Comparison

Dask-Parallel Execution

SCENIC Integration

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace