Pytdc Complete
Powerful skill for therapeutics, data, commons, ready. Includes structured workflows, validation checks, and reusable patterns for scientific.
PyTDC Complete
Access curated drug discovery and therapeutic datasets from the Therapeutics Data Commons (TDC) for machine learning applications. This skill covers dataset loading, benchmark evaluation, molecular property prediction, drug-target interaction modeling, and ADMET prediction workflows.
When to Use This Skill
Choose PyTDC Complete when you need to:
- Access standardized benchmarks for drug discovery ML tasks
- Load curated datasets for ADMET property prediction
- Evaluate drug-target interaction or drug combination models
- Compare ML models against established leaderboards and baselines
Consider alternatives when:
- You need molecular featurization only (use Molfeat or RDKit)
- You need molecular generation models (use REINVENT or GuacaMol)
- You need clinical trial data (use ClinicalTrials.gov API)
Quick Start
pip install PyTDC
from tdc.single_pred import ADME # Load a drug absorption dataset data = ADME(name="Caco2_Wang") split = data.get_split() print(f"Train: {len(split['train'])}") print(f"Valid: {len(split['valid'])}") print(f"Test: {len(split['test'])}") print(f"\nColumns: {split['train'].columns.tolist()}") print(split['train'].head())
Core Concepts
Task Categories
| Category | Class | Example Datasets |
|---|---|---|
| Single-pred ADME | ADME | Caco-2, Lipophilicity, Solubility |
| Single-pred Toxicity | Tox | hERG, AMES, LD50 |
| Single-pred HTS | HTS | HIV, SARS-CoV-2 3CLPro |
| Drug-Target | DTI | DAVIS, KIBA, BindingDB |
| Drug-Drug | DDI | DrugBank, TWOSIDES |
| Drug Combination | DrugComb | DrugComb, OncoPolyPharmacology |
| Molecular Generation | MolGen | MOSES, GuacaMol |
| Reaction | Reaction | USPTO, Buchwald-Hartwig |
ADMET Prediction Pipeline
from tdc.single_pred import ADME, Tox from tdc.benchmark_group import admet_group import pandas as pd import numpy as np from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier from molfeat.trans.fp import FPVecTransformer # Load ADMET benchmark suite group = admet_group(path="./data") predictions_all = {} for task_name in group.dataset_names[:5]: benchmark = group.get(task_name) train, valid = benchmark["train_val"], benchmark["test"] # Featurize molecules morgan = FPVecTransformer(kind="morgan", n_bits=2048) X_train = morgan(train["Drug"].tolist()) X_test = morgan(valid["Drug"].tolist()) y_train = train["Y"].values y_test = valid["Y"].values # Train model (regression or classification based on task) if train["Y"].nunique() > 10: model = GradientBoostingRegressor(n_estimators=100) else: model = GradientBoostingClassifier(n_estimators=100) # Handle NaN features valid_train = ~np.isnan(X_train).any(axis=1) model.fit(X_train[valid_train], y_train[valid_train]) valid_test = ~np.isnan(X_test).any(axis=1) preds = model.predict(X_test[valid_test]) predictions_all[task_name] = preds print(f"{task_name}: trained on {valid_train.sum()} samples") # Evaluate against benchmark results = group.evaluate_many(predictions_all) print("\nBenchmark Results:") print(pd.DataFrame(results))
Drug-Target Interaction
from tdc.multi_pred import DTI # Load DAVIS kinase dataset data = DTI(name="DAVIS") split = data.get_split() print(f"Train interactions: {len(split['train'])}") print(f"Unique drugs: {split['train']['Drug'].nunique()}") print(f"Unique targets: {split['train']['Target'].nunique()}") # The dataset provides: # - Drug: SMILES strings # - Target: Protein sequences # - Y: Binding affinity (Kd values) print(split['train'][["Drug", "Target", "Y"]].head())
Configuration
| Parameter | Description | Default |
|---|---|---|
name | Dataset name within category | Required |
path | Local data cache directory | "./data" |
label_name | Column name for labels | "Y" |
convert_format | Output format (df, dict, DeepPurpose) | "df" |
split_method | Data splitting strategy | "scaffold" |
seed | Random seed for splitting | 42 |
Best Practices
-
Use scaffold splitting for realistic evaluation — The default random split overestimates performance because similar molecules appear in both train and test. Use
data.get_split(method="scaffold")which ensures structurally different molecules in test, mimicking real drug discovery scenarios. -
Benchmark against TDC leaderboards — TDC maintains official leaderboards for each dataset. Use
benchmark_groupto get standardized splits and evaluation metrics. Compare your model against published baselines before claiming improvements. -
Combine multiple ADMET endpoints — Don't optimize for a single property. Use TDC's multi-task datasets to predict solubility, permeability, toxicity, and metabolic stability simultaneously. Multi-task models often outperform single-task models and provide a more complete drug profile.
-
Handle label noise in HTS data — High-throughput screening datasets have significant measurement noise (10-20% label error rate). Use ensemble models or label smoothing to handle noise, and don't over-optimize test set metrics that may reflect noise rather than signal.
-
Report confidence intervals — Run experiments with 5 different random seeds and report mean ± standard deviation. Single-run results on small test sets can vary significantly, and TDC leaderboards require confidence intervals for submissions.
Common Issues
Dataset download fails or is corrupted — TDC downloads datasets from Harvard Dataverse, which may have connectivity issues. Set a local cache directory with path="./data" and retry downloads. If persistent, download datasets manually from the TDC website and place them in the cache directory.
Scaffold splitting produces very unbalanced splits — Some datasets have low structural diversity, causing scaffold splitting to produce splits where most molecules cluster in one partition. Check split sizes and class balance after splitting. If the test set is too small (<50 samples), consider random splitting with a disclaimer.
Feature dimensions don't match between train and test — If using RDKit descriptors, some descriptors may return NaN for certain molecules. Use the same featurizer object for both splits and handle NaN values consistently (impute or remove) across train and test sets.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.