PubChem Database Engine

Query and analyze chemical data from PubChem, the world's largest open chemical database with 110M+ compounds. This skill covers compound search, property retrieval, bioactivity data access, structure similarity search, and chemical informatics workflows.

When to Use This Skill

Choose PubChem Database Engine when you need to:

Look up chemical properties, structures, and identifiers for compounds
Search compounds by name, SMILES, InChI, or molecular formula
Retrieve bioactivity assay data for drug discovery research
Perform structure and substructure similarity searches across the database

Consider alternatives when:

You need drug-target binding data specifically (use ChEMBL)
You need metabolomics pathway data (use HMDB or KEGG)
You need protein structure data (use PDB)

Quick Start


pip install pubchempy requests pandas


import pubchempy as pcp

# Search by name
compounds = pcp.get_compounds("aspirin", "name")
aspirin = compounds[0]

print(f"Name: {aspirin.iupac_name}")
print(f"CID: {aspirin.cid}")
print(f"Formula: {aspirin.molecular_formula}")
print(f"MW: {aspirin.molecular_weight}")
print(f"SMILES: {aspirin.canonical_smiles}")
print(f"XLogP: {aspirin.xlogp}")

# Search by SMILES
results = pcp.get_compounds("CC(=O)OC1=CC=CC=C1C(=O)O", "smiles")
print(f"Found {len(results)} matches")

Core Concepts

Search Types

Method	Input	Description
`get_compounds(name, "name")`	Chemical name	Name-based lookup
`get_compounds(smiles, "smiles")`	SMILES string	Exact structure match
`get_compounds(inchi, "inchi")`	InChI string	Standard identifier lookup
`get_compounds(formula, "formula")`	Molecular formula	Formula-based search
`get_compounds(cid, "cid")`	PubChem CID	Direct ID lookup

Available Properties

Property	Attribute	Description
`molecular_weight`	Float	Molecular weight (g/mol)
`molecular_formula`	String	Chemical formula
`canonical_smiles`	String	Canonical SMILES
`isomeric_smiles`	String	SMILES with stereochemistry
`xlogp`	Float	Octanol-water partition coefficient
`tpsa`	Float	Topological polar surface area
`h_bond_donor_count`	Int	Hydrogen bond donors
`h_bond_acceptor_count`	Int	Hydrogen bond acceptors
`rotatable_bond_count`	Int	Rotatable bonds
`exact_mass`	Float	Monoisotopic mass

Bulk Property Retrieval


import pubchempy as pcp
import pandas as pd

def batch_compound_properties(names):
    """Retrieve properties for a list of compound names."""
    results = []
    for name in names:
        try:
            compounds = pcp.get_compounds(name, "name")
            if compounds:
                c = compounds[0]
                results.append({
                    "name": name,
                    "cid": c.cid,
                    "formula": c.molecular_formula,
                    "mw": c.molecular_weight,
                    "smiles": c.canonical_smiles,
                    "xlogp": c.xlogp,
                    "tpsa": c.tpsa,
                    "hbd": c.h_bond_donor_count,
                    "hba": c.h_bond_acceptor_count,
                    "rotatable_bonds": c.rotatable_bond_count
                })
        except Exception as e:
            results.append({"name": name, "error": str(e)})

    return pd.DataFrame(results)

# Get properties for common drugs
drugs = ["aspirin", "ibuprofen", "acetaminophen", "caffeine", "metformin"]
df = batch_compound_properties(drugs)
print(df[["name", "mw", "xlogp", "tpsa"]].to_string(index=False))

Similarity Search


import requests
import pandas as pd

def similarity_search(smiles, threshold=90, max_results=50):
    """Find structurally similar compounds in PubChem."""
    url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/similarity/smiles"
    response = requests.post(
        f"{url}/{smiles}/JSON",
        params={
            "Threshold": threshold,
            "MaxRecords": max_results
        }
    )

    if response.status_code == 200:
        data = response.json()
        cids = data.get("IdentifierList", {}).get("CID", [])

        # Get properties for similar compounds
        if cids:
            props = requests.get(
                "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/"
                f"{','.join(map(str, cids[:20]))}/property/"
                "MolecularFormula,MolecularWeight,CanonicalSMILES,XLogP/JSON"
            )
            return pd.DataFrame(
                props.json()["PropertyTable"]["Properties"]
            )

    return pd.DataFrame()

# Find compounds similar to caffeine
similar = similarity_search("CN1C=NC2=C1C(=O)N(C(=O)N2C)C", threshold=85)
print(f"Found {len(similar)} similar compounds")
print(similar.head(10))

Configuration

Parameter	Description	Default
`search_type`	Lookup method (name, smiles, cid, etc.)	`"name"`
`max_records`	Maximum results to return	`100`
`similarity_threshold`	Tanimoto similarity cutoff (%)	`90`
`properties`	Properties to retrieve	All available
`output_format`	Response format (JSON, CSV, SDF)	`"JSON"`
`timeout`	Request timeout (seconds)	`30`

Best Practices

Search by SMILES for exact structure matching — Name-based searches can return multiple compounds due to synonyms and naming ambiguity. Use canonical SMILES for unambiguous structure lookups and verify the returned CID matches your intended compound.
Use the PUG REST API for bulk operations — PubChemPy is convenient for small queries but the PUG REST API is more efficient for batch operations. It supports comma-separated CID lists and returns multiple properties in a single request.
Cache compound data for repeated analyses — PubChem API calls are rate-limited to 5 requests per second. Cache CID-to-property mappings locally in a SQLite database or dictionary for datasets you query repeatedly.
Handle isomers explicitly — Many compound names map to multiple stereoisomers. Use isomeric_smiles instead of canonical_smiles when stereochemistry matters for your application (e.g., drug activity prediction).
Cross-reference with other databases using InChIKey — PubChem CIDs are PubChem-specific. Use the InChIKey (standard hash of the structure) to link compounds across PubChem, ChEMBL, DrugBank, and other chemical databases unambiguously.

Common Issues

Name search returns no results — PubChem is strict about chemical names. Try alternative names, trade names, or search by molecular formula instead. For peptides and biologics, PubChem may not have entries — these are better found in UniProt or DrugBank.

Rate limiting causes request failures — PubChem allows a maximum of 5 requests per second. Implement a rate limiter with time.sleep(0.2) between requests. For large-scale data retrieval, use the PubChem FTP download files instead of the API.

SMILES parsing differences between tools — PubChem's canonical SMILES may differ from RDKit's canonicalization. When comparing structures, convert both to InChIKey which is canonicalization-independent, or use PubChem's own standardization endpoint to normalize before comparison.

⚠️ Loading Issue

Pubchem Database Engine

PubChem Database Engine

When to Use This Skill

Quick Start

Core Concepts

Search Types

Available Properties

Bulk Property Retrieval

Similarity Search

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace