P

Pytdc Complete

Powerful skill for therapeutics, data, commons, ready. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

PyTDC Complete

Access curated drug discovery and therapeutic datasets from the Therapeutics Data Commons (TDC) for machine learning applications. This skill covers dataset loading, benchmark evaluation, molecular property prediction, drug-target interaction modeling, and ADMET prediction workflows.

When to Use This Skill

Choose PyTDC Complete when you need to:

  • Access standardized benchmarks for drug discovery ML tasks
  • Load curated datasets for ADMET property prediction
  • Evaluate drug-target interaction or drug combination models
  • Compare ML models against established leaderboards and baselines

Consider alternatives when:

  • You need molecular featurization only (use Molfeat or RDKit)
  • You need molecular generation models (use REINVENT or GuacaMol)
  • You need clinical trial data (use ClinicalTrials.gov API)

Quick Start

pip install PyTDC
from tdc.single_pred import ADME # Load a drug absorption dataset data = ADME(name="Caco2_Wang") split = data.get_split() print(f"Train: {len(split['train'])}") print(f"Valid: {len(split['valid'])}") print(f"Test: {len(split['test'])}") print(f"\nColumns: {split['train'].columns.tolist()}") print(split['train'].head())

Core Concepts

Task Categories

CategoryClassExample Datasets
Single-pred ADMEADMECaco-2, Lipophilicity, Solubility
Single-pred ToxicityToxhERG, AMES, LD50
Single-pred HTSHTSHIV, SARS-CoV-2 3CLPro
Drug-TargetDTIDAVIS, KIBA, BindingDB
Drug-DrugDDIDrugBank, TWOSIDES
Drug CombinationDrugCombDrugComb, OncoPolyPharmacology
Molecular GenerationMolGenMOSES, GuacaMol
ReactionReactionUSPTO, Buchwald-Hartwig

ADMET Prediction Pipeline

from tdc.single_pred import ADME, Tox from tdc.benchmark_group import admet_group import pandas as pd import numpy as np from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier from molfeat.trans.fp import FPVecTransformer # Load ADMET benchmark suite group = admet_group(path="./data") predictions_all = {} for task_name in group.dataset_names[:5]: benchmark = group.get(task_name) train, valid = benchmark["train_val"], benchmark["test"] # Featurize molecules morgan = FPVecTransformer(kind="morgan", n_bits=2048) X_train = morgan(train["Drug"].tolist()) X_test = morgan(valid["Drug"].tolist()) y_train = train["Y"].values y_test = valid["Y"].values # Train model (regression or classification based on task) if train["Y"].nunique() > 10: model = GradientBoostingRegressor(n_estimators=100) else: model = GradientBoostingClassifier(n_estimators=100) # Handle NaN features valid_train = ~np.isnan(X_train).any(axis=1) model.fit(X_train[valid_train], y_train[valid_train]) valid_test = ~np.isnan(X_test).any(axis=1) preds = model.predict(X_test[valid_test]) predictions_all[task_name] = preds print(f"{task_name}: trained on {valid_train.sum()} samples") # Evaluate against benchmark results = group.evaluate_many(predictions_all) print("\nBenchmark Results:") print(pd.DataFrame(results))

Drug-Target Interaction

from tdc.multi_pred import DTI # Load DAVIS kinase dataset data = DTI(name="DAVIS") split = data.get_split() print(f"Train interactions: {len(split['train'])}") print(f"Unique drugs: {split['train']['Drug'].nunique()}") print(f"Unique targets: {split['train']['Target'].nunique()}") # The dataset provides: # - Drug: SMILES strings # - Target: Protein sequences # - Y: Binding affinity (Kd values) print(split['train'][["Drug", "Target", "Y"]].head())

Configuration

ParameterDescriptionDefault
nameDataset name within categoryRequired
pathLocal data cache directory"./data"
label_nameColumn name for labels"Y"
convert_formatOutput format (df, dict, DeepPurpose)"df"
split_methodData splitting strategy"scaffold"
seedRandom seed for splitting42

Best Practices

  1. Use scaffold splitting for realistic evaluation — The default random split overestimates performance because similar molecules appear in both train and test. Use data.get_split(method="scaffold") which ensures structurally different molecules in test, mimicking real drug discovery scenarios.

  2. Benchmark against TDC leaderboards — TDC maintains official leaderboards for each dataset. Use benchmark_group to get standardized splits and evaluation metrics. Compare your model against published baselines before claiming improvements.

  3. Combine multiple ADMET endpoints — Don't optimize for a single property. Use TDC's multi-task datasets to predict solubility, permeability, toxicity, and metabolic stability simultaneously. Multi-task models often outperform single-task models and provide a more complete drug profile.

  4. Handle label noise in HTS data — High-throughput screening datasets have significant measurement noise (10-20% label error rate). Use ensemble models or label smoothing to handle noise, and don't over-optimize test set metrics that may reflect noise rather than signal.

  5. Report confidence intervals — Run experiments with 5 different random seeds and report mean ± standard deviation. Single-run results on small test sets can vary significantly, and TDC leaderboards require confidence intervals for submissions.

Common Issues

Dataset download fails or is corrupted — TDC downloads datasets from Harvard Dataverse, which may have connectivity issues. Set a local cache directory with path="./data" and retry downloads. If persistent, download datasets manually from the TDC website and place them in the cache directory.

Scaffold splitting produces very unbalanced splits — Some datasets have low structural diversity, causing scaffold splitting to produce splits where most molecules cluster in one partition. Check split sizes and class balance after splitting. If the test set is too small (<50 samples), consider random splitting with a disclaimer.

Feature dimensions don't match between train and test — If using RDKit descriptors, some descriptors may return NaN for certain molecules. Use the same featurizer object for both splits and handle NaN values consistently (impute or remove) across train and test sets.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates