P

Pubchem Database Engine

Production-ready skill that handles query, pubchem, rest, pubchempy. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

PubChem Database Engine

Query and analyze chemical data from PubChem, the world's largest open chemical database with 110M+ compounds. This skill covers compound search, property retrieval, bioactivity data access, structure similarity search, and chemical informatics workflows.

When to Use This Skill

Choose PubChem Database Engine when you need to:

  • Look up chemical properties, structures, and identifiers for compounds
  • Search compounds by name, SMILES, InChI, or molecular formula
  • Retrieve bioactivity assay data for drug discovery research
  • Perform structure and substructure similarity searches across the database

Consider alternatives when:

  • You need drug-target binding data specifically (use ChEMBL)
  • You need metabolomics pathway data (use HMDB or KEGG)
  • You need protein structure data (use PDB)

Quick Start

pip install pubchempy requests pandas
import pubchempy as pcp # Search by name compounds = pcp.get_compounds("aspirin", "name") aspirin = compounds[0] print(f"Name: {aspirin.iupac_name}") print(f"CID: {aspirin.cid}") print(f"Formula: {aspirin.molecular_formula}") print(f"MW: {aspirin.molecular_weight}") print(f"SMILES: {aspirin.canonical_smiles}") print(f"XLogP: {aspirin.xlogp}") # Search by SMILES results = pcp.get_compounds("CC(=O)OC1=CC=CC=C1C(=O)O", "smiles") print(f"Found {len(results)} matches")

Core Concepts

Search Types

MethodInputDescription
get_compounds(name, "name")Chemical nameName-based lookup
get_compounds(smiles, "smiles")SMILES stringExact structure match
get_compounds(inchi, "inchi")InChI stringStandard identifier lookup
get_compounds(formula, "formula")Molecular formulaFormula-based search
get_compounds(cid, "cid")PubChem CIDDirect ID lookup

Available Properties

PropertyAttributeDescription
molecular_weightFloatMolecular weight (g/mol)
molecular_formulaStringChemical formula
canonical_smilesStringCanonical SMILES
isomeric_smilesStringSMILES with stereochemistry
xlogpFloatOctanol-water partition coefficient
tpsaFloatTopological polar surface area
h_bond_donor_countIntHydrogen bond donors
h_bond_acceptor_countIntHydrogen bond acceptors
rotatable_bond_countIntRotatable bonds
exact_massFloatMonoisotopic mass

Bulk Property Retrieval

import pubchempy as pcp import pandas as pd def batch_compound_properties(names): """Retrieve properties for a list of compound names.""" results = [] for name in names: try: compounds = pcp.get_compounds(name, "name") if compounds: c = compounds[0] results.append({ "name": name, "cid": c.cid, "formula": c.molecular_formula, "mw": c.molecular_weight, "smiles": c.canonical_smiles, "xlogp": c.xlogp, "tpsa": c.tpsa, "hbd": c.h_bond_donor_count, "hba": c.h_bond_acceptor_count, "rotatable_bonds": c.rotatable_bond_count }) except Exception as e: results.append({"name": name, "error": str(e)}) return pd.DataFrame(results) # Get properties for common drugs drugs = ["aspirin", "ibuprofen", "acetaminophen", "caffeine", "metformin"] df = batch_compound_properties(drugs) print(df[["name", "mw", "xlogp", "tpsa"]].to_string(index=False))
import requests import pandas as pd def similarity_search(smiles, threshold=90, max_results=50): """Find structurally similar compounds in PubChem.""" url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/similarity/smiles" response = requests.post( f"{url}/{smiles}/JSON", params={ "Threshold": threshold, "MaxRecords": max_results } ) if response.status_code == 200: data = response.json() cids = data.get("IdentifierList", {}).get("CID", []) # Get properties for similar compounds if cids: props = requests.get( "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/" f"{','.join(map(str, cids[:20]))}/property/" "MolecularFormula,MolecularWeight,CanonicalSMILES,XLogP/JSON" ) return pd.DataFrame( props.json()["PropertyTable"]["Properties"] ) return pd.DataFrame() # Find compounds similar to caffeine similar = similarity_search("CN1C=NC2=C1C(=O)N(C(=O)N2C)C", threshold=85) print(f"Found {len(similar)} similar compounds") print(similar.head(10))

Configuration

ParameterDescriptionDefault
search_typeLookup method (name, smiles, cid, etc.)"name"
max_recordsMaximum results to return100
similarity_thresholdTanimoto similarity cutoff (%)90
propertiesProperties to retrieveAll available
output_formatResponse format (JSON, CSV, SDF)"JSON"
timeoutRequest timeout (seconds)30

Best Practices

  1. Search by SMILES for exact structure matching — Name-based searches can return multiple compounds due to synonyms and naming ambiguity. Use canonical SMILES for unambiguous structure lookups and verify the returned CID matches your intended compound.

  2. Use the PUG REST API for bulk operations — PubChemPy is convenient for small queries but the PUG REST API is more efficient for batch operations. It supports comma-separated CID lists and returns multiple properties in a single request.

  3. Cache compound data for repeated analyses — PubChem API calls are rate-limited to 5 requests per second. Cache CID-to-property mappings locally in a SQLite database or dictionary for datasets you query repeatedly.

  4. Handle isomers explicitly — Many compound names map to multiple stereoisomers. Use isomeric_smiles instead of canonical_smiles when stereochemistry matters for your application (e.g., drug activity prediction).

  5. Cross-reference with other databases using InChIKey — PubChem CIDs are PubChem-specific. Use the InChIKey (standard hash of the structure) to link compounds across PubChem, ChEMBL, DrugBank, and other chemical databases unambiguously.

Common Issues

Name search returns no results — PubChem is strict about chemical names. Try alternative names, trade names, or search by molecular formula instead. For peptides and biologics, PubChem may not have entries — these are better found in UniProt or DrugBank.

Rate limiting causes request failures — PubChem allows a maximum of 5 requests per second. Implement a rate limiter with time.sleep(0.2) between requests. For large-scale data retrieval, use the PubChem FTP download files instead of the API.

SMILES parsing differences between tools — PubChem's canonical SMILES may differ from RDKit's canonicalization. When comparing structures, convert both to InChIKey which is canonicalization-independent, or use PubChem's own standardization endpoint to normalize before comparison.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates