Datamol Kit

A scientific computing skill for cheminformatics using Datamol — the Python library providing a lightweight, Pythonic abstraction layer over RDKit for molecular manipulation, property calculation, fingerprinting, and chemical data processing in drug discovery workflows.

When to Use This Skill

Choose Datamol Kit when:

Processing and standardizing chemical structures from SMILES
Calculating molecular descriptors and drug-likeness properties
Computing molecular fingerprints for similarity search
Preparing chemical datasets for machine learning models

Consider alternatives when:

You need full RDKit functionality directly (use RDKit)
You need deep learning on molecules (use DeepChem)
You need molecular docking (use DiffDock or AutoDock)
You need reaction prediction (use RXNMapper or specialized tools)

Quick Start


claude "Process a set of SMILES and calculate drug-likeness properties"


import datamol as dm

# Parse and standardize a molecule
smiles = "CC(=O)Oc1ccccc1C(=O)O"  # Aspirin
mol = dm.to_mol(smiles)
standardized = dm.standardize_mol(mol)
canonical = dm.to_smiles(standardized)
print(f"Canonical SMILES: {canonical}")

# Calculate properties
props = dm.descriptors.compute_many_descriptors(mol)
print(f"MW: {props['mw']:.1f}")
print(f"LogP: {props['clogp']:.2f}")
print(f"HBD: {props['n_hbd']}")
print(f"HBA: {props['n_hba']}")
print(f"TPSA: {props['tpsa']:.1f}")
print(f"Rotatable bonds: {props['n_rotatable_bonds']}")

# Lipinski Rule of 5 check
lipinski = (
    props["mw"] <= 500 and
    props["clogp"] <= 5 and
    props["n_hbd"] <= 5 and
    props["n_hba"] <= 10
)
print(f"Lipinski compliant: {lipinski}")

Core Concepts

Datamol Core Functions

Function	Purpose	Example
`dm.to_mol()`	Parse SMILES to molecule	`dm.to_mol("CCO")`
`dm.to_smiles()`	Convert molecule to SMILES	`dm.to_smiles(mol)`
`dm.standardize_mol()`	Standardize representation	Remove salts, neutralize
`dm.to_inchi()`	Convert to InChI identifier	Unique structure ID
`dm.to_selfies()`	Convert to SELFIES notation	ML-friendly encoding
`dm.from_smarts()`	Parse SMARTS pattern	Substructure queries

Molecular Fingerprints


import datamol as dm
import numpy as np

mols = [dm.to_mol(s) for s in [
    "CC(=O)Oc1ccccc1C(=O)O",  # Aspirin
    "CC(=O)NC1=CC=C(O)C=C1",   # Acetaminophen
    "CC12CCC3C(CCC4CC(=O)CCC34C)C1CCC2O"  # Testosterone
]]

# Morgan fingerprints (circular)
fps = [dm.to_fp(mol, fp_type="morgan", n_bits=2048) for mol in mols]

# Tanimoto similarity
from rdkit import DataStructs
sim = DataStructs.TanimotoSimilarity(fps[0], fps[1])
print(f"Aspirin vs Acetaminophen: {sim:.3f}")

sim2 = DataStructs.TanimotoSimilarity(fps[0], fps[2])
print(f"Aspirin vs Testosterone: {sim2:.3f}")

Batch Processing


import datamol as dm
import pandas as pd

# Process SMILES dataset
df = pd.read_csv("compounds.csv")

# Parallelized processing
df["mol"] = dm.parallelized(dm.to_mol, df["smiles"].tolist(), n_jobs=-1)
df = df.dropna(subset=["mol"])

# Standardize
df["mol"] = dm.parallelized(dm.standardize_mol, df["mol"].tolist())
df["canonical_smiles"] = dm.parallelized(dm.to_smiles, df["mol"].tolist())

# Calculate descriptors in batch
descriptors = dm.parallelized(
    dm.descriptors.compute_many_descriptors,
    df["mol"].tolist(),
    n_jobs=-1
)
desc_df = pd.DataFrame(descriptors)
df = pd.concat([df, desc_df], axis=1)

Configuration

Parameter	Description	Default
`fp_type`	Fingerprint type (morgan, maccs, topological)	`morgan`
`n_bits`	Fingerprint bit length	`2048`
`radius`	Morgan fingerprint radius	`2`
`n_jobs`	Parallel workers for batch ops	`-1` (all cores)
`standardize`	Auto-standardize on parse	`true`

Best Practices

Always standardize molecules before comparison. Different SMILES can represent the same molecule. Use dm.standardize_mol() to canonicalize tautomers, remove salts, and neutralize charges before computing fingerprints or descriptors.
Use dm.parallelized() for batch operations. Processing thousands of molecules sequentially is slow. Datamol's parallel wrapper handles chunking and multiprocessing automatically, providing near-linear speedup on multi-core machines.
Choose fingerprints appropriate to your task. Morgan (ECFP) fingerprints are best for similarity search and ML models. MACCS keys are better for substructure-based filtering. TopologicalTorsion fingerprints capture different structural features.
Filter invalid SMILES early. Use dm.to_mol() with error handling to catch unparseable SMILES before they cause downstream failures. Drop None results immediately rather than propagating invalid molecules through your pipeline.
Store molecules as canonical SMILES. After standardization, convert back to SMILES with dm.to_smiles() for storage. Canonical SMILES are compact, human-readable, and can be re-parsed consistently.

Common Issues

dm.to_mol() returns None for valid SMILES. Some SMILES conventions differ between providers. Try with dm.to_mol(smiles, sanitize=False) then dm.standardize_mol() to handle edge cases. If it still fails, the SMILES may contain non-standard notation.

Fingerprint similarity seems wrong. Verify both molecules were standardized and that you're using the same fingerprint parameters (type, radius, n_bits) for both. Different radii capture different structural features and produce incomparable fingerprints.

Parallel processing crashes with large datasets. Memory-map the molecule objects instead of keeping all in memory. Process in chunks of 10,000 molecules, writing results incrementally. Use n_jobs carefully — too many workers competing for memory can cause OOM.

⚠️ Loading Issue

Datamol Kit

Datamol Kit

When to Use This Skill

Quick Start

Core Concepts

Datamol Core Functions

Molecular Fingerprints

Batch Processing

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace