PyTDC Complete

Access curated drug discovery and therapeutic datasets from the Therapeutics Data Commons (TDC) for machine learning applications. This skill covers dataset loading, benchmark evaluation, molecular property prediction, drug-target interaction modeling, and ADMET prediction workflows.

When to Use This Skill

Choose PyTDC Complete when you need to:

Access standardized benchmarks for drug discovery ML tasks
Load curated datasets for ADMET property prediction
Evaluate drug-target interaction or drug combination models
Compare ML models against established leaderboards and baselines

Consider alternatives when:

You need molecular featurization only (use Molfeat or RDKit)
You need molecular generation models (use REINVENT or GuacaMol)
You need clinical trial data (use ClinicalTrials.gov API)

Quick Start


pip install PyTDC


from tdc.single_pred import ADME

# Load a drug absorption dataset
data = ADME(name="Caco2_Wang")
split = data.get_split()

print(f"Train: {len(split['train'])}")
print(f"Valid: {len(split['valid'])}")
print(f"Test: {len(split['test'])}")
print(f"\nColumns: {split['train'].columns.tolist()}")
print(split['train'].head())

Core Concepts

Task Categories

Category	Class	Example Datasets
Single-pred ADME	`ADME`	Caco-2, Lipophilicity, Solubility
Single-pred Toxicity	`Tox`	hERG, AMES, LD50
Single-pred HTS	`HTS`	HIV, SARS-CoV-2 3CLPro
Drug-Target	`DTI`	DAVIS, KIBA, BindingDB
Drug-Drug	`DDI`	DrugBank, TWOSIDES
Drug Combination	`DrugComb`	DrugComb, OncoPolyPharmacology
Molecular Generation	`MolGen`	MOSES, GuacaMol
Reaction	`Reaction`	USPTO, Buchwald-Hartwig

ADMET Prediction Pipeline


from tdc.single_pred import ADME, Tox
from tdc.benchmark_group import admet_group
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from molfeat.trans.fp import FPVecTransformer

# Load ADMET benchmark suite
group = admet_group(path="./data")
predictions_all = {}

for task_name in group.dataset_names[:5]:
    benchmark = group.get(task_name)
    train, valid = benchmark["train_val"], benchmark["test"]

    # Featurize molecules
    morgan = FPVecTransformer(kind="morgan", n_bits=2048)
    X_train = morgan(train["Drug"].tolist())
    X_test = morgan(valid["Drug"].tolist())
    y_train = train["Y"].values
    y_test = valid["Y"].values

    # Train model (regression or classification based on task)
    if train["Y"].nunique() > 10:
        model = GradientBoostingRegressor(n_estimators=100)
    else:
        model = GradientBoostingClassifier(n_estimators=100)

    # Handle NaN features
    valid_train = ~np.isnan(X_train).any(axis=1)
    model.fit(X_train[valid_train], y_train[valid_train])

    valid_test = ~np.isnan(X_test).any(axis=1)
    preds = model.predict(X_test[valid_test])

    predictions_all[task_name] = preds
    print(f"{task_name}: trained on {valid_train.sum()} samples")

# Evaluate against benchmark
results = group.evaluate_many(predictions_all)
print("\nBenchmark Results:")
print(pd.DataFrame(results))

Drug-Target Interaction


from tdc.multi_pred import DTI

# Load DAVIS kinase dataset
data = DTI(name="DAVIS")
split = data.get_split()

print(f"Train interactions: {len(split['train'])}")
print(f"Unique drugs: {split['train']['Drug'].nunique()}")
print(f"Unique targets: {split['train']['Target'].nunique()}")

# The dataset provides:
# - Drug: SMILES strings
# - Target: Protein sequences
# - Y: Binding affinity (Kd values)
print(split['train'][["Drug", "Target", "Y"]].head())

Configuration

Parameter	Description	Default
`name`	Dataset name within category	Required
`path`	Local data cache directory	`"./data"`
`label_name`	Column name for labels	`"Y"`
`convert_format`	Output format (df, dict, DeepPurpose)	`"df"`
`split_method`	Data splitting strategy	`"scaffold"`
`seed`	Random seed for splitting	`42`

Best Practices

Use scaffold splitting for realistic evaluation — The default random split overestimates performance because similar molecules appear in both train and test. Use data.get_split(method="scaffold") which ensures structurally different molecules in test, mimicking real drug discovery scenarios.
Benchmark against TDC leaderboards — TDC maintains official leaderboards for each dataset. Use benchmark_group to get standardized splits and evaluation metrics. Compare your model against published baselines before claiming improvements.
Combine multiple ADMET endpoints — Don't optimize for a single property. Use TDC's multi-task datasets to predict solubility, permeability, toxicity, and metabolic stability simultaneously. Multi-task models often outperform single-task models and provide a more complete drug profile.
Handle label noise in HTS data — High-throughput screening datasets have significant measurement noise (10-20% label error rate). Use ensemble models or label smoothing to handle noise, and don't over-optimize test set metrics that may reflect noise rather than signal.
Report confidence intervals — Run experiments with 5 different random seeds and report mean ± standard deviation. Single-run results on small test sets can vary significantly, and TDC leaderboards require confidence intervals for submissions.

Common Issues

Dataset download fails or is corrupted — TDC downloads datasets from Harvard Dataverse, which may have connectivity issues. Set a local cache directory with path="./data" and retry downloads. If persistent, download datasets manually from the TDC website and place them in the cache directory.

Scaffold splitting produces very unbalanced splits — Some datasets have low structural diversity, causing scaffold splitting to produce splits where most molecules cluster in one partition. Check split sizes and class balance after splitting. If the test set is too small (<50 samples), consider random splitting with a disclaimer.

Feature dimensions don't match between train and test — If using RDKit descriptors, some descriptors may return NaN for certain molecules. Use the same featurizer object for both splits and handle NaN values consistently (impute or remove) across train and test sets.

⚠️ Loading Issue

Pytdc Complete

PyTDC Complete

When to Use This Skill

Quick Start

Core Concepts

Task Categories

ADMET Prediction Pipeline

Drug-Target Interaction

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace