RDKit Elite

Perform cheminformatics analysis using RDKit, the industry-standard open-source toolkit for molecular informatics. This skill covers molecule handling, fingerprint generation, substructure search, descriptor calculation, chemical reactions, and building molecular analysis pipelines.

When to Use This Skill

Choose RDKit Elite when you need to:

Parse, validate, and manipulate chemical structures (SMILES, SDF, MOL)
Calculate molecular descriptors and fingerprints for ML models
Perform substructure and similarity searching across compound libraries
Enumerate chemical reactions and R-group decomposition

Consider alternatives when:

You need a unified molecular featurization API (use Molfeat)
You need drug-likeness filtering with pre-built rules (use medchem)
You need 3D molecular docking (use AutoDock Vina)

Quick Start


pip install rdkit-pypi


from rdkit import Chem
from rdkit.Chem import Descriptors, Draw, AllChem

# Parse and analyze a molecule
mol = Chem.MolFromSmiles("CC(=O)Oc1ccccc1C(=O)O")  # Aspirin
print(f"Formula: {Chem.rdMolDescriptors.CalcMolFormula(mol)}")
print(f"MW: {Descriptors.MolWt(mol):.2f}")
print(f"LogP: {Descriptors.MolLogP(mol):.2f}")
print(f"HBD: {Descriptors.NumHDonors(mol)}")
print(f"HBA: {Descriptors.NumHAcceptors(mol)}")
print(f"TPSA: {Descriptors.TPSA(mol):.2f}")
print(f"Rotatable bonds: {Descriptors.NumRotatableBonds(mol)}")

Core Concepts

Molecule Operations

Operation	Function	Description
Parse SMILES	`Chem.MolFromSmiles(smi)`	Create mol from SMILES
Parse file	`Chem.MolFromMolFile(path)`	Read MOL/SDF file
Canonical SMILES	`Chem.MolToSmiles(mol)`	Canonical string representation
Add Hs	`Chem.AddHs(mol)`	Add explicit hydrogens
3D coordinates	`AllChem.EmbedMolecule(mol)`	Generate 3D conformation
Sanitize	`Chem.SanitizeMol(mol)`	Validate and clean structure

Similarity Search and Fingerprints


from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem
import numpy as np

def similarity_search(query_smi, database_smiles, top_n=10):
    """Find most similar molecules using Morgan fingerprints."""
    query_mol = Chem.MolFromSmiles(query_smi)
    query_fp = AllChem.GetMorganFingerprintAsBitVect(
        query_mol, radius=2, nBits=2048
    )

    results = []
    for smi in database_smiles:
        mol = Chem.MolFromSmiles(smi)
        if mol is None:
            continue
        fp = AllChem.GetMorganFingerprintAsBitVect(
            mol, radius=2, nBits=2048
        )
        similarity = DataStructs.TanimotoSimilarity(query_fp, fp)
        results.append((smi, similarity))

    results.sort(key=lambda x: -x[1])
    return results[:top_n]

# Search
query = "c1ccc(NC(=O)c2ccccc2)cc1"  # Benzanilide
database = [
    "c1ccc(NC(=O)c2ccc(Cl)cc2)cc1",
    "CC(=O)Oc1ccccc1C(=O)O",
    "c1ccc(NC(=O)c2ccc(F)cc2)cc1",
    "c1ccccc1",
    "c1ccc(NC(=O)c2ccncc2)cc1"
]

results = similarity_search(query, database)
for smi, sim in results:
    print(f"  {sim:.3f} | {smi}")

Substructure Analysis


from rdkit import Chem
from rdkit.Chem import rdFMCS

def find_common_substructure(smiles_list):
    """Find maximum common substructure across molecules."""
    mols = [Chem.MolFromSmiles(smi) for smi in smiles_list]
    mols = [m for m in mols if m is not None]

    mcs = rdFMCS.FindMCS(
        mols,
        threshold=0.8,
        ringMatchesRingOnly=True,
        timeout=60
    )

    print(f"MCS SMARTS: {mcs.smartsString}")
    print(f"MCS atoms: {mcs.numAtoms}")
    print(f"MCS bonds: {mcs.numBonds}")

    # Check which molecules contain the MCS
    pattern = Chem.MolFromSmarts(mcs.smartsString)
    for smi in smiles_list:
        mol = Chem.MolFromSmiles(smi)
        if mol and mol.HasSubstructMatch(pattern):
            print(f"  ✓ {smi}")

    return mcs

smiles = [
    "c1ccc(NC(=O)c2ccccc2)cc1",
    "c1ccc(NC(=O)c2ccc(Cl)cc2)cc1",
    "c1ccc(NC(=O)c2ccc(F)cc2)cc1"
]
find_common_substructure(smiles)

Configuration

Parameter	Description	Default
`sanitize`	Validate molecular structure on creation	`true`
`radius`	Morgan fingerprint radius	`2`
`nBits`	Fingerprint bit vector length	`2048`
`useChirality`	Include stereochemistry in fingerprints	`false`
`force_field`	3D optimization force field	`"MMFF94"`
`max_attempts`	3D embedding retries	`10`

Best Practices

Always check for None after parsing — Chem.MolFromSmiles() returns None for invalid SMILES silently. Always check if mol is None: continue before proceeding. Log failed molecules for investigation rather than silently skipping them.
Use canonical SMILES for deduplication — Different SMILES strings can represent the same molecule. Convert all SMILES to canonical form with Chem.MolToSmiles(mol) before comparing or storing. This ensures consistent representation across your dataset.
Choose the right fingerprint for your task — Morgan (ECFP) fingerprints capture circular substructures and work well for activity prediction. MACCS keys are better for substructure-based similarity. Topological fingerprints capture path-based features. Benchmark different types on your specific task.
Sanitize molecules explicitly for custom workflows — When modifying molecules programmatically (adding atoms, changing bonds), call Chem.SanitizeMol(mol) afterward to validate the structure. Unsanitized molecules can have invalid valences that produce incorrect descriptors.
Use Chem.MolFromSmiles(smi, sanitize=False) for debugging — When a SMILES string fails to parse, disable sanitization to isolate whether the error is in the SMILES syntax or in chemical validation. Then call Chem.SanitizeMol(mol) separately to get a specific error message.

Common Issues

Morgan fingerprints give different results than ECFP — RDKit's Morgan fingerprints use a different hashing algorithm than Pipeline Pilot's ECFP, so bit positions differ even with the same radius. The fingerprints are functionally equivalent for similarity ranking but not bit-for-bit identical. Don't compare fingerprints generated by different tools.

3D conformer generation fails for some molecules — AllChem.EmbedMolecule() returns -1 when it can't find a valid 3D conformation. Try AllChem.EmbedMolecule(mol, maxAttempts=100, useRandomCoords=True) for difficult cases. Some molecules (strained rings, unusual valences) may require force field relaxation with AllChem.MMFFOptimizeMolecule(mol).

Descriptor calculation returns NaN or infinity — Some descriptors are undefined for certain molecule types (e.g., LogP for salts, TPSA for molecules without polar atoms). Check for NaN values with np.isnan() and handle them before feeding descriptors into ML models. Use Descriptors.MolWt(mol) as a sanity check — if it's NaN, the molecule is malformed.

⚠️ Loading Issue

Rdkit Elite

RDKit Elite

When to Use This Skill

Quick Start

Core Concepts

Molecule Operations

Similarity Search and Fingerprints

Substructure Analysis

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace