D

Datamol Kit

Enterprise-grade skill for pythonic, wrapper, around, rdkit. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Datamol Kit

A scientific computing skill for cheminformatics using Datamol — the Python library providing a lightweight, Pythonic abstraction layer over RDKit for molecular manipulation, property calculation, fingerprinting, and chemical data processing in drug discovery workflows.

When to Use This Skill

Choose Datamol Kit when:

  • Processing and standardizing chemical structures from SMILES
  • Calculating molecular descriptors and drug-likeness properties
  • Computing molecular fingerprints for similarity search
  • Preparing chemical datasets for machine learning models

Consider alternatives when:

  • You need full RDKit functionality directly (use RDKit)
  • You need deep learning on molecules (use DeepChem)
  • You need molecular docking (use DiffDock or AutoDock)
  • You need reaction prediction (use RXNMapper or specialized tools)

Quick Start

claude "Process a set of SMILES and calculate drug-likeness properties"
import datamol as dm # Parse and standardize a molecule smiles = "CC(=O)Oc1ccccc1C(=O)O" # Aspirin mol = dm.to_mol(smiles) standardized = dm.standardize_mol(mol) canonical = dm.to_smiles(standardized) print(f"Canonical SMILES: {canonical}") # Calculate properties props = dm.descriptors.compute_many_descriptors(mol) print(f"MW: {props['mw']:.1f}") print(f"LogP: {props['clogp']:.2f}") print(f"HBD: {props['n_hbd']}") print(f"HBA: {props['n_hba']}") print(f"TPSA: {props['tpsa']:.1f}") print(f"Rotatable bonds: {props['n_rotatable_bonds']}") # Lipinski Rule of 5 check lipinski = ( props["mw"] <= 500 and props["clogp"] <= 5 and props["n_hbd"] <= 5 and props["n_hba"] <= 10 ) print(f"Lipinski compliant: {lipinski}")

Core Concepts

Datamol Core Functions

FunctionPurposeExample
dm.to_mol()Parse SMILES to moleculedm.to_mol("CCO")
dm.to_smiles()Convert molecule to SMILESdm.to_smiles(mol)
dm.standardize_mol()Standardize representationRemove salts, neutralize
dm.to_inchi()Convert to InChI identifierUnique structure ID
dm.to_selfies()Convert to SELFIES notationML-friendly encoding
dm.from_smarts()Parse SMARTS patternSubstructure queries

Molecular Fingerprints

import datamol as dm import numpy as np mols = [dm.to_mol(s) for s in [ "CC(=O)Oc1ccccc1C(=O)O", # Aspirin "CC(=O)NC1=CC=C(O)C=C1", # Acetaminophen "CC12CCC3C(CCC4CC(=O)CCC34C)C1CCC2O" # Testosterone ]] # Morgan fingerprints (circular) fps = [dm.to_fp(mol, fp_type="morgan", n_bits=2048) for mol in mols] # Tanimoto similarity from rdkit import DataStructs sim = DataStructs.TanimotoSimilarity(fps[0], fps[1]) print(f"Aspirin vs Acetaminophen: {sim:.3f}") sim2 = DataStructs.TanimotoSimilarity(fps[0], fps[2]) print(f"Aspirin vs Testosterone: {sim2:.3f}")

Batch Processing

import datamol as dm import pandas as pd # Process SMILES dataset df = pd.read_csv("compounds.csv") # Parallelized processing df["mol"] = dm.parallelized(dm.to_mol, df["smiles"].tolist(), n_jobs=-1) df = df.dropna(subset=["mol"]) # Standardize df["mol"] = dm.parallelized(dm.standardize_mol, df["mol"].tolist()) df["canonical_smiles"] = dm.parallelized(dm.to_smiles, df["mol"].tolist()) # Calculate descriptors in batch descriptors = dm.parallelized( dm.descriptors.compute_many_descriptors, df["mol"].tolist(), n_jobs=-1 ) desc_df = pd.DataFrame(descriptors) df = pd.concat([df, desc_df], axis=1)

Configuration

ParameterDescriptionDefault
fp_typeFingerprint type (morgan, maccs, topological)morgan
n_bitsFingerprint bit length2048
radiusMorgan fingerprint radius2
n_jobsParallel workers for batch ops-1 (all cores)
standardizeAuto-standardize on parsetrue

Best Practices

  1. Always standardize molecules before comparison. Different SMILES can represent the same molecule. Use dm.standardize_mol() to canonicalize tautomers, remove salts, and neutralize charges before computing fingerprints or descriptors.

  2. Use dm.parallelized() for batch operations. Processing thousands of molecules sequentially is slow. Datamol's parallel wrapper handles chunking and multiprocessing automatically, providing near-linear speedup on multi-core machines.

  3. Choose fingerprints appropriate to your task. Morgan (ECFP) fingerprints are best for similarity search and ML models. MACCS keys are better for substructure-based filtering. TopologicalTorsion fingerprints capture different structural features.

  4. Filter invalid SMILES early. Use dm.to_mol() with error handling to catch unparseable SMILES before they cause downstream failures. Drop None results immediately rather than propagating invalid molecules through your pipeline.

  5. Store molecules as canonical SMILES. After standardization, convert back to SMILES with dm.to_smiles() for storage. Canonical SMILES are compact, human-readable, and can be re-parsed consistently.

Common Issues

dm.to_mol() returns None for valid SMILES. Some SMILES conventions differ between providers. Try with dm.to_mol(smiles, sanitize=False) then dm.standardize_mol() to handle edge cases. If it still fails, the SMILES may contain non-standard notation.

Fingerprint similarity seems wrong. Verify both molecules were standardized and that you're using the same fingerprint parameters (type, radius, n_bits) for both. Different radii capture different structural features and produce incomparable fingerprints.

Parallel processing crashes with large datasets. Memory-map the molecule objects instead of keeping all in memory. Process in chunks of 10,000 molecules, writing results incrementally. Use n_jobs carefully — too many workers competing for memory can cause OOM.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates