Datamol Kit
Enterprise-grade skill for pythonic, wrapper, around, rdkit. Includes structured workflows, validation checks, and reusable patterns for scientific.
Datamol Kit
A scientific computing skill for cheminformatics using Datamol — the Python library providing a lightweight, Pythonic abstraction layer over RDKit for molecular manipulation, property calculation, fingerprinting, and chemical data processing in drug discovery workflows.
When to Use This Skill
Choose Datamol Kit when:
- Processing and standardizing chemical structures from SMILES
- Calculating molecular descriptors and drug-likeness properties
- Computing molecular fingerprints for similarity search
- Preparing chemical datasets for machine learning models
Consider alternatives when:
- You need full RDKit functionality directly (use RDKit)
- You need deep learning on molecules (use DeepChem)
- You need molecular docking (use DiffDock or AutoDock)
- You need reaction prediction (use RXNMapper or specialized tools)
Quick Start
claude "Process a set of SMILES and calculate drug-likeness properties"
import datamol as dm # Parse and standardize a molecule smiles = "CC(=O)Oc1ccccc1C(=O)O" # Aspirin mol = dm.to_mol(smiles) standardized = dm.standardize_mol(mol) canonical = dm.to_smiles(standardized) print(f"Canonical SMILES: {canonical}") # Calculate properties props = dm.descriptors.compute_many_descriptors(mol) print(f"MW: {props['mw']:.1f}") print(f"LogP: {props['clogp']:.2f}") print(f"HBD: {props['n_hbd']}") print(f"HBA: {props['n_hba']}") print(f"TPSA: {props['tpsa']:.1f}") print(f"Rotatable bonds: {props['n_rotatable_bonds']}") # Lipinski Rule of 5 check lipinski = ( props["mw"] <= 500 and props["clogp"] <= 5 and props["n_hbd"] <= 5 and props["n_hba"] <= 10 ) print(f"Lipinski compliant: {lipinski}")
Core Concepts
Datamol Core Functions
| Function | Purpose | Example |
|---|---|---|
dm.to_mol() | Parse SMILES to molecule | dm.to_mol("CCO") |
dm.to_smiles() | Convert molecule to SMILES | dm.to_smiles(mol) |
dm.standardize_mol() | Standardize representation | Remove salts, neutralize |
dm.to_inchi() | Convert to InChI identifier | Unique structure ID |
dm.to_selfies() | Convert to SELFIES notation | ML-friendly encoding |
dm.from_smarts() | Parse SMARTS pattern | Substructure queries |
Molecular Fingerprints
import datamol as dm import numpy as np mols = [dm.to_mol(s) for s in [ "CC(=O)Oc1ccccc1C(=O)O", # Aspirin "CC(=O)NC1=CC=C(O)C=C1", # Acetaminophen "CC12CCC3C(CCC4CC(=O)CCC34C)C1CCC2O" # Testosterone ]] # Morgan fingerprints (circular) fps = [dm.to_fp(mol, fp_type="morgan", n_bits=2048) for mol in mols] # Tanimoto similarity from rdkit import DataStructs sim = DataStructs.TanimotoSimilarity(fps[0], fps[1]) print(f"Aspirin vs Acetaminophen: {sim:.3f}") sim2 = DataStructs.TanimotoSimilarity(fps[0], fps[2]) print(f"Aspirin vs Testosterone: {sim2:.3f}")
Batch Processing
import datamol as dm import pandas as pd # Process SMILES dataset df = pd.read_csv("compounds.csv") # Parallelized processing df["mol"] = dm.parallelized(dm.to_mol, df["smiles"].tolist(), n_jobs=-1) df = df.dropna(subset=["mol"]) # Standardize df["mol"] = dm.parallelized(dm.standardize_mol, df["mol"].tolist()) df["canonical_smiles"] = dm.parallelized(dm.to_smiles, df["mol"].tolist()) # Calculate descriptors in batch descriptors = dm.parallelized( dm.descriptors.compute_many_descriptors, df["mol"].tolist(), n_jobs=-1 ) desc_df = pd.DataFrame(descriptors) df = pd.concat([df, desc_df], axis=1)
Configuration
| Parameter | Description | Default |
|---|---|---|
fp_type | Fingerprint type (morgan, maccs, topological) | morgan |
n_bits | Fingerprint bit length | 2048 |
radius | Morgan fingerprint radius | 2 |
n_jobs | Parallel workers for batch ops | -1 (all cores) |
standardize | Auto-standardize on parse | true |
Best Practices
-
Always standardize molecules before comparison. Different SMILES can represent the same molecule. Use
dm.standardize_mol()to canonicalize tautomers, remove salts, and neutralize charges before computing fingerprints or descriptors. -
Use
dm.parallelized()for batch operations. Processing thousands of molecules sequentially is slow. Datamol's parallel wrapper handles chunking and multiprocessing automatically, providing near-linear speedup on multi-core machines. -
Choose fingerprints appropriate to your task. Morgan (ECFP) fingerprints are best for similarity search and ML models. MACCS keys are better for substructure-based filtering. TopologicalTorsion fingerprints capture different structural features.
-
Filter invalid SMILES early. Use
dm.to_mol()with error handling to catch unparseable SMILES before they cause downstream failures. Drop None results immediately rather than propagating invalid molecules through your pipeline. -
Store molecules as canonical SMILES. After standardization, convert back to SMILES with
dm.to_smiles()for storage. Canonical SMILES are compact, human-readable, and can be re-parsed consistently.
Common Issues
dm.to_mol() returns None for valid SMILES. Some SMILES conventions differ between providers. Try with dm.to_mol(smiles, sanitize=False) then dm.standardize_mol() to handle edge cases. If it still fails, the SMILES may contain non-standard notation.
Fingerprint similarity seems wrong. Verify both molecules were standardized and that you're using the same fingerprint parameters (type, radius, n_bits) for both. Different radii capture different structural features and produce incomparable fingerprints.
Parallel processing crashes with large datasets. Memory-map the molecule objects instead of keeping all in memory. Process in chunks of 10,000 molecules, writing results incrementally. Use n_jobs carefully — too many workers competing for memory can cause OOM.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.