P

Pro Deepchem

Comprehensive skill designed for molecular, machine, learning, toolkit. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Pro DeepChem

A scientific computing skill for applying deep learning to chemistry, biology, and materials science using DeepChem — the comprehensive Python library providing molecular featurizers, dataset loaders, model architectures, and evaluation tools for scientific machine learning.

When to Use This Skill

Choose Pro DeepChem when:

  • Building ML/DL models for molecular property prediction
  • Training graph neural networks on molecular structures
  • Working with MoleculeNet benchmark datasets
  • Applying deep learning to drug discovery or materials design

Consider alternatives when:

  • You need simple descriptor-based ML (use scikit-learn + RDKit)
  • You need molecular docking (use DiffDock or AutoDock)
  • You need protein structure prediction (use AlphaFold)
  • You need cheminformatics without ML (use Datamol or RDKit)

Quick Start

claude "Train a graph neural network to predict molecular solubility"
import deepchem as dc # Load MoleculeNet dataset (solubility prediction) tasks, datasets, transformers = dc.molnet.load_delaney( featurizer="GraphConv" ) train, valid, test = datasets print(f"Tasks: {tasks}") print(f"Train: {len(train)} | Valid: {len(valid)} | Test: {len(test)}") # Build a Graph Convolutional Network model = dc.models.GraphConvModel( n_tasks=len(tasks), mode="regression", batch_size=64, learning_rate=0.001 ) # Train model.fit(train, nb_epoch=100) # Evaluate metric = dc.metrics.Metric(dc.metrics.pearson_r2_score) train_scores = model.evaluate(train, [metric], transformers) test_scores = model.evaluate(test, [metric], transformers) print(f"Train R²: {train_scores['pearson_r2_score']:.4f}") print(f"Test R²: {test_scores['pearson_r2_score']:.4f}")

Core Concepts

DeepChem Pipeline

ComponentPurposeExamples
FeaturizerConvert molecules to model inputsGraphConv, ECFP, Weave
DatasetStore features, labels, weightsNumpyDataset, DiskDataset
ModelML/DL model architecturesGCN, AttentiveFP, MPNN
SplitterTrain/valid/test splittingScaffold, Random, Stratified
MetricEvaluation metricsR², RMSE, ROC-AUC
TransformerData normalizationNormalizationTransformer

Model Architectures

import deepchem as dc # Graph Convolutional Network gcn = dc.models.GraphConvModel(n_tasks=1, mode="regression") # Attentive Fingerprint (attention-based GNN) attfp = dc.models.AttentiveFPModel(n_tasks=1, mode="regression", num_layers=3, graph_feat_size=200) # Message Passing Neural Network mpnn = dc.models.MPNNModel(n_tasks=1, mode="regression", n_atom_feat=75, n_pair_feat=14) # Multitask Classifier (fingerprint-based) mt = dc.models.MultitaskClassifier(n_tasks=12, n_features=2048, layer_sizes=[1024, 512])

MoleculeNet Benchmarks

# Available benchmark datasets datasets = { "delaney": "Solubility prediction (regression)", "tox21": "Toxicity prediction (12-task classification)", "hiv": "HIV inhibition (classification)", "bbbp": "Blood-brain barrier permeation (classification)", "sider": "Side effect prediction (27-task classification)", "clintox": "Clinical trial toxicity (2-task classification)", "qm7": "Quantum mechanical properties (regression)", "qm9": "Quantum mechanical properties (regression)", } # Load any MoleculeNet dataset tasks, datasets, transformers = dc.molnet.load_tox21(featurizer="GraphConv") train, valid, test = datasets

Configuration

ParameterDescriptionDefault
featurizerMolecular representationGraphConv
splitterData splitting strategyscaffold
n_tasksNumber of prediction targetsDataset-specific
batch_sizeTraining batch size64
learning_rateOptimizer learning rate0.001

Best Practices

  1. Use scaffold splitting for realistic evaluation. Random splits leak structural information between train and test sets. Scaffold splitting ensures test molecules are structurally dissimilar from training data, giving a more realistic estimate of performance on novel compounds.

  2. Start with Graph Convolutional Networks. GCNs are a good default for molecular property prediction — they operate directly on molecular graphs without manual feature engineering. Switch to AttentiveFP for better accuracy or fingerprint-based models for faster training.

  3. Apply appropriate transformers. Use NormalizationTransformer for regression tasks to standardize labels. This helps model convergence and makes learning rates more transferable across datasets.

  4. Use MoleculeNet benchmarks for validation. Before applying models to your proprietary data, verify your pipeline on MoleculeNet datasets with published baselines. If your scores are significantly below published results, your pipeline has issues.

  5. Ensemble multiple models for production. No single architecture dominates all molecular property prediction tasks. Train 3-5 diverse models (GCN, AttentiveFP, fingerprint-based) and ensemble their predictions for more robust results.

Common Issues

Model training loss doesn't decrease. Check learning rate (try 1e-4 to 1e-2 range), verify that the featurizer matches the model architecture (GraphConv models need GraphConv features), and ensure labels are properly normalized for regression tasks.

Extremely poor test set performance despite good training metrics. Likely overfitting on a small dataset. Use scaffold splitting (not random), add dropout, reduce model capacity, or use data augmentation (SMILES enumeration). For datasets with < 1000 molecules, simpler models often outperform deep learning.

Featurization fails on some molecules. Complex or unusual molecules (organometallics, polymers, salts) may fail featurization. Filter these out before training or use SMILES-based featurizers that handle a broader range of chemical structures.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates