Pro DeepChem

A scientific computing skill for applying deep learning to chemistry, biology, and materials science using DeepChem — the comprehensive Python library providing molecular featurizers, dataset loaders, model architectures, and evaluation tools for scientific machine learning.

When to Use This Skill

Choose Pro DeepChem when:

Building ML/DL models for molecular property prediction
Training graph neural networks on molecular structures
Working with MoleculeNet benchmark datasets
Applying deep learning to drug discovery or materials design

Consider alternatives when:

You need simple descriptor-based ML (use scikit-learn + RDKit)
You need molecular docking (use DiffDock or AutoDock)
You need protein structure prediction (use AlphaFold)
You need cheminformatics without ML (use Datamol or RDKit)

Quick Start


claude "Train a graph neural network to predict molecular solubility"


import deepchem as dc

# Load MoleculeNet dataset (solubility prediction)
tasks, datasets, transformers = dc.molnet.load_delaney(
    featurizer="GraphConv"
)
train, valid, test = datasets

print(f"Tasks: {tasks}")
print(f"Train: {len(train)} | Valid: {len(valid)} | Test: {len(test)}")

# Build a Graph Convolutional Network
model = dc.models.GraphConvModel(
    n_tasks=len(tasks),
    mode="regression",
    batch_size=64,
    learning_rate=0.001
)

# Train
model.fit(train, nb_epoch=100)

# Evaluate
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
train_scores = model.evaluate(train, [metric], transformers)
test_scores = model.evaluate(test, [metric], transformers)

print(f"Train R²: {train_scores['pearson_r2_score']:.4f}")
print(f"Test R²: {test_scores['pearson_r2_score']:.4f}")

Core Concepts

DeepChem Pipeline

Component	Purpose	Examples
Featurizer	Convert molecules to model inputs	GraphConv, ECFP, Weave
Dataset	Store features, labels, weights	NumpyDataset, DiskDataset
Model	ML/DL model architectures	GCN, AttentiveFP, MPNN
Splitter	Train/valid/test splitting	Scaffold, Random, Stratified
Metric	Evaluation metrics	R², RMSE, ROC-AUC
Transformer	Data normalization	NormalizationTransformer

Model Architectures


import deepchem as dc

# Graph Convolutional Network
gcn = dc.models.GraphConvModel(n_tasks=1, mode="regression")

# Attentive Fingerprint (attention-based GNN)
attfp = dc.models.AttentiveFPModel(n_tasks=1, mode="regression",
                                    num_layers=3, graph_feat_size=200)

# Message Passing Neural Network
mpnn = dc.models.MPNNModel(n_tasks=1, mode="regression",
                           n_atom_feat=75, n_pair_feat=14)

# Multitask Classifier (fingerprint-based)
mt = dc.models.MultitaskClassifier(n_tasks=12, n_features=2048,
                                    layer_sizes=[1024, 512])

MoleculeNet Benchmarks


# Available benchmark datasets
datasets = {
    "delaney": "Solubility prediction (regression)",
    "tox21": "Toxicity prediction (12-task classification)",
    "hiv": "HIV inhibition (classification)",
    "bbbp": "Blood-brain barrier permeation (classification)",
    "sider": "Side effect prediction (27-task classification)",
    "clintox": "Clinical trial toxicity (2-task classification)",
    "qm7": "Quantum mechanical properties (regression)",
    "qm9": "Quantum mechanical properties (regression)",
}

# Load any MoleculeNet dataset
tasks, datasets, transformers = dc.molnet.load_tox21(featurizer="GraphConv")
train, valid, test = datasets

Configuration

Parameter	Description	Default
`featurizer`	Molecular representation	`GraphConv`
`splitter`	Data splitting strategy	`scaffold`
`n_tasks`	Number of prediction targets	Dataset-specific
`batch_size`	Training batch size	`64`
`learning_rate`	Optimizer learning rate	`0.001`

Best Practices

Use scaffold splitting for realistic evaluation. Random splits leak structural information between train and test sets. Scaffold splitting ensures test molecules are structurally dissimilar from training data, giving a more realistic estimate of performance on novel compounds.
Start with Graph Convolutional Networks. GCNs are a good default for molecular property prediction — they operate directly on molecular graphs without manual feature engineering. Switch to AttentiveFP for better accuracy or fingerprint-based models for faster training.
Apply appropriate transformers. Use NormalizationTransformer for regression tasks to standardize labels. This helps model convergence and makes learning rates more transferable across datasets.
Use MoleculeNet benchmarks for validation. Before applying models to your proprietary data, verify your pipeline on MoleculeNet datasets with published baselines. If your scores are significantly below published results, your pipeline has issues.
Ensemble multiple models for production. No single architecture dominates all molecular property prediction tasks. Train 3-5 diverse models (GCN, AttentiveFP, fingerprint-based) and ensemble their predictions for more robust results.

Common Issues

Model training loss doesn't decrease. Check learning rate (try 1e-4 to 1e-2 range), verify that the featurizer matches the model architecture (GraphConv models need GraphConv features), and ensure labels are properly normalized for regression tasks.

Extremely poor test set performance despite good training metrics. Likely overfitting on a small dataset. Use scaffold splitting (not random), add dropout, reduce model capacity, or use data augmentation (SMILES enumeration). For datasets with < 1000 molecules, simpler models often outperform deep learning.

Featurization fails on some molecules. Complex or unusual molecules (organometallics, polymers, salts) may fail featurization. Filter these out before training or use SMILES-based featurizers that handle a broader range of chemical structures.

⚠️ Loading Issue

Pro Deepchem

Pro DeepChem

When to Use This Skill

Quick Start

Core Concepts

DeepChem Pipeline

Model Architectures

MoleculeNet Benchmarks

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace