Torchdrug Studio
Comprehensive skill designed for graph, based, drug, discovery. Includes structured workflows, validation checks, and reusable patterns for scientific.
Torchdrug Studio
Apply machine learning to drug discovery and molecular science using TorchDrug, a PyTorch-based framework for graph neural networks on molecules. This skill covers molecular property prediction, molecule generation, retrosynthesis, protein representation learning, and knowledge graph reasoning for drug targets.
When to Use This Skill
Choose Torchdrug Studio when you need to:
- Predict molecular properties (solubility, toxicity, binding affinity) from structure
- Generate novel molecules with desired properties using deep generative models
- Plan retrosynthetic routes for target molecules
- Learn protein representations for function prediction or drug-target interaction
Consider alternatives when:
- You need molecular dynamics simulation (use OpenMM or GROMACS)
- You need cheminformatics without deep learning (use RDKit)
- You need custom GNN architectures (use PyTorch Geometric directly)
Quick Start
pip install torchdrug torch
import torch from torchdrug import datasets, models, tasks, core from torchdrug.layers import geometry # Load molecular property dataset dataset = datasets.ClinTox( "~/data/clintox/", node_feature="default", edge_feature="default" ) print(f"Molecules: {len(dataset)}") print(f"Tasks: {dataset.tasks}") # Split dataset train_set, valid_set, test_set = dataset.split() # Define GIN model for property prediction model = models.GIN( input_dim=dataset.node_feature_dim, hidden_dims=[256, 256, 256, 256], short_cut=True, batch_norm=True, concat_hidden=True, ) # Define prediction task task = tasks.PropertyPrediction( model, task=dataset.tasks, criterion="bce", metric=("auprc", "auroc"), ) # Train optimizer = torch.optim.Adam(task.parameters(), lr=1e-3) solver = core.Engine( task, train_set, valid_set, test_set, optimizer, batch_size=256, gpus=[0] ) solver.train(num_epoch=100) solver.evaluate("valid")
Core Concepts
TorchDrug Task Types
| Task | Class | Application |
|---|---|---|
| Property Prediction | tasks.PropertyPrediction | ADMET, toxicity, solubility |
| Multi-task | tasks.PropertyPrediction(task=[...]) | Multiple endpoints |
| Generation | tasks.AutoregressiveGeneration | De novo molecule design |
| Retrosynthesis | tasks.CenterIdentification | Synthesis planning |
| Protein Function | tasks.PropertyPrediction | GO term prediction |
| Interaction | tasks.InteractionPrediction | Drug-target binding |
Molecule Generation
import torch from torchdrug import models, tasks, core, datasets # Load training molecules (e.g., ZINC250k) dataset = datasets.ZINC250k( "~/data/zinc250k/", node_feature="symbol" ) train_set, valid_set, test_set = dataset.split() # Autoregressive generation model model = models.GCPN( input_dim=dataset.node_feature_dim, hidden_dims=[256, 256, 256], batch_norm=True, ) task = tasks.GCPNGeneration( model, dataset.atom_types, max_edge_unroll=12, max_node=38, criterion="ppo", reward_temperature=1, agent_update_interval=3, lazy_hill_climbing=1, ) optimizer = torch.optim.Adam(task.parameters(), lr=1e-3) solver = core.Engine( task, train_set, valid_set, test_set, optimizer, batch_size=128, gpus=[0] ) solver.train(num_epoch=10) # Generate molecules results = task.generate(num_sample=100) print(f"Generated {len(results)} molecules") for i, mol in enumerate(results[:5]): print(f" Molecule {i}: {mol.to_smiles()}")
Configuration
| Parameter | Description | Default |
|---|---|---|
hidden_dims | Hidden layer dimensions | [256, 256, 256] |
batch_norm | Use batch normalization | True |
short_cut | Residual/skip connections | True |
concat_hidden | Concatenate all hidden layers for readout | True |
batch_size | Training batch size | 256 |
learning_rate | Optimizer learning rate | 1e-3 |
num_epoch | Training epochs | 100 |
criterion | Loss function (bce, ce, mse) | Task-dependent |
metric | Evaluation metrics | ("auroc",) |
Best Practices
-
Use multi-task learning for ADMET property prediction — Train a single model on multiple endpoints simultaneously. Multi-task models share molecular representations and often outperform individual models, especially when some endpoints have limited data. Pass multiple task names to
PropertyPrediction. -
Start with GIN for property prediction, switch to MPNN for edge features — GIN (Graph Isomorphism Network) is the most expressive for graph-level tasks when using only node features. If edge features (bond type, stereochemistry) are important, use MPNN or SchNet which explicitly incorporate edge information.
-
Featurize molecules consistently — Use
node_feature="default"for standard atom features (element, degree, charge, hybridization). Custom featurization should be done before splitting to avoid data leakage. Document your featurization scheme for reproducibility. -
Validate on scaffold splits, not random splits — Random splits overestimate generalization because training and test molecules may share similar scaffolds. Use Murcko scaffold splits (
dataset.split(method="scaffold")) to evaluate whether the model generalizes to structurally novel molecules. -
Monitor both AUROC and AUPRC for imbalanced datasets — Many molecular property datasets are heavily imbalanced (e.g., 5% toxic, 95% non-toxic). AUROC can be misleadingly high on imbalanced data. AUPRC (area under precision-recall curve) is more informative for the minority class.
Common Issues
Dataset download fails or is corrupted — TorchDrug downloads datasets from external URLs that may be slow or unavailable. Set the path argument to a persistent directory (not /tmp) to avoid re-downloading. If downloads fail, manually download the dataset files and place them in the expected directory.
Training loss is NaN after a few epochs — This often happens with molecular data containing unusual atoms or extreme property values. Check for NaN values in the dataset (dataset.data.y), clip extreme values, and reduce learning rate. Also ensure the loss function matches the task type (BCE for classification, MSE for regression).
Generated molecules are invalid or unrealistic — Autoregressive generation doesn't guarantee chemical validity. Post-filter with RDKit: mol = Chem.MolFromSmiles(smiles); if mol is not None to check validity. Increase training epochs and dataset diversity to improve generation quality.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.