Torchdrug Studio

Apply machine learning to drug discovery and molecular science using TorchDrug, a PyTorch-based framework for graph neural networks on molecules. This skill covers molecular property prediction, molecule generation, retrosynthesis, protein representation learning, and knowledge graph reasoning for drug targets.

When to Use This Skill

Choose Torchdrug Studio when you need to:

Predict molecular properties (solubility, toxicity, binding affinity) from structure
Generate novel molecules with desired properties using deep generative models
Plan retrosynthetic routes for target molecules
Learn protein representations for function prediction or drug-target interaction

Consider alternatives when:

You need molecular dynamics simulation (use OpenMM or GROMACS)
You need cheminformatics without deep learning (use RDKit)
You need custom GNN architectures (use PyTorch Geometric directly)

Quick Start


pip install torchdrug torch


import torch
from torchdrug import datasets, models, tasks, core
from torchdrug.layers import geometry

# Load molecular property dataset
dataset = datasets.ClinTox(
    "~/data/clintox/",
    node_feature="default",
    edge_feature="default"
)
print(f"Molecules: {len(dataset)}")
print(f"Tasks: {dataset.tasks}")

# Split dataset
train_set, valid_set, test_set = dataset.split()

# Define GIN model for property prediction
model = models.GIN(
    input_dim=dataset.node_feature_dim,
    hidden_dims=[256, 256, 256, 256],
    short_cut=True,
    batch_norm=True,
    concat_hidden=True,
)

# Define prediction task
task = tasks.PropertyPrediction(
    model,
    task=dataset.tasks,
    criterion="bce",
    metric=("auprc", "auroc"),
)

# Train
optimizer = torch.optim.Adam(task.parameters(), lr=1e-3)
solver = core.Engine(
    task, train_set, valid_set, test_set,
    optimizer, batch_size=256, gpus=[0]
)
solver.train(num_epoch=100)
solver.evaluate("valid")

Core Concepts

TorchDrug Task Types

Task	Class	Application
Property Prediction	`tasks.PropertyPrediction`	ADMET, toxicity, solubility
Multi-task	`tasks.PropertyPrediction(task=[...])`	Multiple endpoints
Generation	`tasks.AutoregressiveGeneration`	De novo molecule design
Retrosynthesis	`tasks.CenterIdentification`	Synthesis planning
Protein Function	`tasks.PropertyPrediction`	GO term prediction
Interaction	`tasks.InteractionPrediction`	Drug-target binding

Molecule Generation


import torch
from torchdrug import models, tasks, core, datasets

# Load training molecules (e.g., ZINC250k)
dataset = datasets.ZINC250k(
    "~/data/zinc250k/",
    node_feature="symbol"
)
train_set, valid_set, test_set = dataset.split()

# Autoregressive generation model
model = models.GCPN(
    input_dim=dataset.node_feature_dim,
    hidden_dims=[256, 256, 256],
    batch_norm=True,
)

task = tasks.GCPNGeneration(
    model,
    dataset.atom_types,
    max_edge_unroll=12,
    max_node=38,
    criterion="ppo",
    reward_temperature=1,
    agent_update_interval=3,
    lazy_hill_climbing=1,
)

optimizer = torch.optim.Adam(task.parameters(), lr=1e-3)
solver = core.Engine(
    task, train_set, valid_set, test_set,
    optimizer, batch_size=128, gpus=[0]
)
solver.train(num_epoch=10)

# Generate molecules
results = task.generate(num_sample=100)
print(f"Generated {len(results)} molecules")
for i, mol in enumerate(results[:5]):
    print(f"  Molecule {i}: {mol.to_smiles()}")

Configuration

Parameter	Description	Default
`hidden_dims`	Hidden layer dimensions	`[256, 256, 256]`
`batch_norm`	Use batch normalization	`True`
`short_cut`	Residual/skip connections	`True`
`concat_hidden`	Concatenate all hidden layers for readout	`True`
`batch_size`	Training batch size	`256`
`learning_rate`	Optimizer learning rate	`1e-3`
`num_epoch`	Training epochs	`100`
`criterion`	Loss function (bce, ce, mse)	Task-dependent
`metric`	Evaluation metrics	`("auroc",)`

Best Practices

Use multi-task learning for ADMET property prediction — Train a single model on multiple endpoints simultaneously. Multi-task models share molecular representations and often outperform individual models, especially when some endpoints have limited data. Pass multiple task names to PropertyPrediction.
Start with GIN for property prediction, switch to MPNN for edge features — GIN (Graph Isomorphism Network) is the most expressive for graph-level tasks when using only node features. If edge features (bond type, stereochemistry) are important, use MPNN or SchNet which explicitly incorporate edge information.
Featurize molecules consistently — Use node_feature="default" for standard atom features (element, degree, charge, hybridization). Custom featurization should be done before splitting to avoid data leakage. Document your featurization scheme for reproducibility.
Validate on scaffold splits, not random splits — Random splits overestimate generalization because training and test molecules may share similar scaffolds. Use Murcko scaffold splits (dataset.split(method="scaffold")) to evaluate whether the model generalizes to structurally novel molecules.
Monitor both AUROC and AUPRC for imbalanced datasets — Many molecular property datasets are heavily imbalanced (e.g., 5% toxic, 95% non-toxic). AUROC can be misleadingly high on imbalanced data. AUPRC (area under precision-recall curve) is more informative for the minority class.

Common Issues

Dataset download fails or is corrupted — TorchDrug downloads datasets from external URLs that may be slow or unavailable. Set the path argument to a persistent directory (not /tmp) to avoid re-downloading. If downloads fail, manually download the dataset files and place them in the expected directory.

Training loss is NaN after a few epochs — This often happens with molecular data containing unusual atoms or extreme property values. Check for NaN values in the dataset (dataset.data.y), clip extreme values, and reduce learning rate. Also ensure the loss function matches the task type (BCE for classification, MSE for regression).

Generated molecules are invalid or unrealistic — Autoregressive generation doesn't guarantee chemical validity. Post-filter with RDKit: mol = Chem.MolFromSmiles(smiles); if mol is not None to check validity. Increase training epochs and dataset diversity to improve generation quality.

⚠️ Loading Issue

Torchdrug Studio

Torchdrug Studio

When to Use This Skill

Quick Start

Core Concepts

TorchDrug Task Types

Molecule Generation

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace