T

Torchdrug Studio

Comprehensive skill designed for graph, based, drug, discovery. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Torchdrug Studio

Apply machine learning to drug discovery and molecular science using TorchDrug, a PyTorch-based framework for graph neural networks on molecules. This skill covers molecular property prediction, molecule generation, retrosynthesis, protein representation learning, and knowledge graph reasoning for drug targets.

When to Use This Skill

Choose Torchdrug Studio when you need to:

  • Predict molecular properties (solubility, toxicity, binding affinity) from structure
  • Generate novel molecules with desired properties using deep generative models
  • Plan retrosynthetic routes for target molecules
  • Learn protein representations for function prediction or drug-target interaction

Consider alternatives when:

  • You need molecular dynamics simulation (use OpenMM or GROMACS)
  • You need cheminformatics without deep learning (use RDKit)
  • You need custom GNN architectures (use PyTorch Geometric directly)

Quick Start

pip install torchdrug torch
import torch from torchdrug import datasets, models, tasks, core from torchdrug.layers import geometry # Load molecular property dataset dataset = datasets.ClinTox( "~/data/clintox/", node_feature="default", edge_feature="default" ) print(f"Molecules: {len(dataset)}") print(f"Tasks: {dataset.tasks}") # Split dataset train_set, valid_set, test_set = dataset.split() # Define GIN model for property prediction model = models.GIN( input_dim=dataset.node_feature_dim, hidden_dims=[256, 256, 256, 256], short_cut=True, batch_norm=True, concat_hidden=True, ) # Define prediction task task = tasks.PropertyPrediction( model, task=dataset.tasks, criterion="bce", metric=("auprc", "auroc"), ) # Train optimizer = torch.optim.Adam(task.parameters(), lr=1e-3) solver = core.Engine( task, train_set, valid_set, test_set, optimizer, batch_size=256, gpus=[0] ) solver.train(num_epoch=100) solver.evaluate("valid")

Core Concepts

TorchDrug Task Types

TaskClassApplication
Property Predictiontasks.PropertyPredictionADMET, toxicity, solubility
Multi-tasktasks.PropertyPrediction(task=[...])Multiple endpoints
Generationtasks.AutoregressiveGenerationDe novo molecule design
Retrosynthesistasks.CenterIdentificationSynthesis planning
Protein Functiontasks.PropertyPredictionGO term prediction
Interactiontasks.InteractionPredictionDrug-target binding

Molecule Generation

import torch from torchdrug import models, tasks, core, datasets # Load training molecules (e.g., ZINC250k) dataset = datasets.ZINC250k( "~/data/zinc250k/", node_feature="symbol" ) train_set, valid_set, test_set = dataset.split() # Autoregressive generation model model = models.GCPN( input_dim=dataset.node_feature_dim, hidden_dims=[256, 256, 256], batch_norm=True, ) task = tasks.GCPNGeneration( model, dataset.atom_types, max_edge_unroll=12, max_node=38, criterion="ppo", reward_temperature=1, agent_update_interval=3, lazy_hill_climbing=1, ) optimizer = torch.optim.Adam(task.parameters(), lr=1e-3) solver = core.Engine( task, train_set, valid_set, test_set, optimizer, batch_size=128, gpus=[0] ) solver.train(num_epoch=10) # Generate molecules results = task.generate(num_sample=100) print(f"Generated {len(results)} molecules") for i, mol in enumerate(results[:5]): print(f" Molecule {i}: {mol.to_smiles()}")

Configuration

ParameterDescriptionDefault
hidden_dimsHidden layer dimensions[256, 256, 256]
batch_normUse batch normalizationTrue
short_cutResidual/skip connectionsTrue
concat_hiddenConcatenate all hidden layers for readoutTrue
batch_sizeTraining batch size256
learning_rateOptimizer learning rate1e-3
num_epochTraining epochs100
criterionLoss function (bce, ce, mse)Task-dependent
metricEvaluation metrics("auroc",)

Best Practices

  1. Use multi-task learning for ADMET property prediction — Train a single model on multiple endpoints simultaneously. Multi-task models share molecular representations and often outperform individual models, especially when some endpoints have limited data. Pass multiple task names to PropertyPrediction.

  2. Start with GIN for property prediction, switch to MPNN for edge features — GIN (Graph Isomorphism Network) is the most expressive for graph-level tasks when using only node features. If edge features (bond type, stereochemistry) are important, use MPNN or SchNet which explicitly incorporate edge information.

  3. Featurize molecules consistently — Use node_feature="default" for standard atom features (element, degree, charge, hybridization). Custom featurization should be done before splitting to avoid data leakage. Document your featurization scheme for reproducibility.

  4. Validate on scaffold splits, not random splits — Random splits overestimate generalization because training and test molecules may share similar scaffolds. Use Murcko scaffold splits (dataset.split(method="scaffold")) to evaluate whether the model generalizes to structurally novel molecules.

  5. Monitor both AUROC and AUPRC for imbalanced datasets — Many molecular property datasets are heavily imbalanced (e.g., 5% toxic, 95% non-toxic). AUROC can be misleadingly high on imbalanced data. AUPRC (area under precision-recall curve) is more informative for the minority class.

Common Issues

Dataset download fails or is corrupted — TorchDrug downloads datasets from external URLs that may be slow or unavailable. Set the path argument to a persistent directory (not /tmp) to avoid re-downloading. If downloads fail, manually download the dataset files and place them in the expected directory.

Training loss is NaN after a few epochs — This often happens with molecular data containing unusual atoms or extreme property values. Check for NaN values in the dataset (dataset.data.y), clip extreme values, and reduce learning rate. Also ensure the loss function matches the task type (BCE for classification, MSE for regression).

Generated molecules are invalid or unrealistic — Autoregressive generation doesn't guarantee chemical validity. Post-filter with RDKit: mol = Chem.MolFromSmiles(smiles); if mol is not None to check validity. Increase training epochs and dataset diversity to improve generation quality.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates