Pro Deepchem
Comprehensive skill designed for molecular, machine, learning, toolkit. Includes structured workflows, validation checks, and reusable patterns for scientific.
Pro DeepChem
A scientific computing skill for applying deep learning to chemistry, biology, and materials science using DeepChem — the comprehensive Python library providing molecular featurizers, dataset loaders, model architectures, and evaluation tools for scientific machine learning.
When to Use This Skill
Choose Pro DeepChem when:
- Building ML/DL models for molecular property prediction
- Training graph neural networks on molecular structures
- Working with MoleculeNet benchmark datasets
- Applying deep learning to drug discovery or materials design
Consider alternatives when:
- You need simple descriptor-based ML (use scikit-learn + RDKit)
- You need molecular docking (use DiffDock or AutoDock)
- You need protein structure prediction (use AlphaFold)
- You need cheminformatics without ML (use Datamol or RDKit)
Quick Start
claude "Train a graph neural network to predict molecular solubility"
import deepchem as dc # Load MoleculeNet dataset (solubility prediction) tasks, datasets, transformers = dc.molnet.load_delaney( featurizer="GraphConv" ) train, valid, test = datasets print(f"Tasks: {tasks}") print(f"Train: {len(train)} | Valid: {len(valid)} | Test: {len(test)}") # Build a Graph Convolutional Network model = dc.models.GraphConvModel( n_tasks=len(tasks), mode="regression", batch_size=64, learning_rate=0.001 ) # Train model.fit(train, nb_epoch=100) # Evaluate metric = dc.metrics.Metric(dc.metrics.pearson_r2_score) train_scores = model.evaluate(train, [metric], transformers) test_scores = model.evaluate(test, [metric], transformers) print(f"Train R²: {train_scores['pearson_r2_score']:.4f}") print(f"Test R²: {test_scores['pearson_r2_score']:.4f}")
Core Concepts
DeepChem Pipeline
| Component | Purpose | Examples |
|---|---|---|
| Featurizer | Convert molecules to model inputs | GraphConv, ECFP, Weave |
| Dataset | Store features, labels, weights | NumpyDataset, DiskDataset |
| Model | ML/DL model architectures | GCN, AttentiveFP, MPNN |
| Splitter | Train/valid/test splitting | Scaffold, Random, Stratified |
| Metric | Evaluation metrics | R², RMSE, ROC-AUC |
| Transformer | Data normalization | NormalizationTransformer |
Model Architectures
import deepchem as dc # Graph Convolutional Network gcn = dc.models.GraphConvModel(n_tasks=1, mode="regression") # Attentive Fingerprint (attention-based GNN) attfp = dc.models.AttentiveFPModel(n_tasks=1, mode="regression", num_layers=3, graph_feat_size=200) # Message Passing Neural Network mpnn = dc.models.MPNNModel(n_tasks=1, mode="regression", n_atom_feat=75, n_pair_feat=14) # Multitask Classifier (fingerprint-based) mt = dc.models.MultitaskClassifier(n_tasks=12, n_features=2048, layer_sizes=[1024, 512])
MoleculeNet Benchmarks
# Available benchmark datasets datasets = { "delaney": "Solubility prediction (regression)", "tox21": "Toxicity prediction (12-task classification)", "hiv": "HIV inhibition (classification)", "bbbp": "Blood-brain barrier permeation (classification)", "sider": "Side effect prediction (27-task classification)", "clintox": "Clinical trial toxicity (2-task classification)", "qm7": "Quantum mechanical properties (regression)", "qm9": "Quantum mechanical properties (regression)", } # Load any MoleculeNet dataset tasks, datasets, transformers = dc.molnet.load_tox21(featurizer="GraphConv") train, valid, test = datasets
Configuration
| Parameter | Description | Default |
|---|---|---|
featurizer | Molecular representation | GraphConv |
splitter | Data splitting strategy | scaffold |
n_tasks | Number of prediction targets | Dataset-specific |
batch_size | Training batch size | 64 |
learning_rate | Optimizer learning rate | 0.001 |
Best Practices
-
Use scaffold splitting for realistic evaluation. Random splits leak structural information between train and test sets. Scaffold splitting ensures test molecules are structurally dissimilar from training data, giving a more realistic estimate of performance on novel compounds.
-
Start with Graph Convolutional Networks. GCNs are a good default for molecular property prediction — they operate directly on molecular graphs without manual feature engineering. Switch to AttentiveFP for better accuracy or fingerprint-based models for faster training.
-
Apply appropriate transformers. Use
NormalizationTransformerfor regression tasks to standardize labels. This helps model convergence and makes learning rates more transferable across datasets. -
Use MoleculeNet benchmarks for validation. Before applying models to your proprietary data, verify your pipeline on MoleculeNet datasets with published baselines. If your scores are significantly below published results, your pipeline has issues.
-
Ensemble multiple models for production. No single architecture dominates all molecular property prediction tasks. Train 3-5 diverse models (GCN, AttentiveFP, fingerprint-based) and ensemble their predictions for more robust results.
Common Issues
Model training loss doesn't decrease. Check learning rate (try 1e-4 to 1e-2 range), verify that the featurizer matches the model architecture (GraphConv models need GraphConv features), and ensure labels are properly normalized for regression tasks.
Extremely poor test set performance despite good training metrics. Likely overfitting on a small dataset. Use scaffold splitting (not random), add dropout, reduce model capacity, or use data augmentation (SMILES enumeration). For datasets with < 1000 molecules, simpler models often outperform deep learning.
Featurization fails on some molecules. Complex or unusual molecules (organometallics, polymers, salts) may fail featurization. Filter these out before training or use SMILES-based featurizers that handle a broader range of chemical structures.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.