Pubchem Database Engine
Production-ready skill that handles query, pubchem, rest, pubchempy. Includes structured workflows, validation checks, and reusable patterns for scientific.
PubChem Database Engine
Query and analyze chemical data from PubChem, the world's largest open chemical database with 110M+ compounds. This skill covers compound search, property retrieval, bioactivity data access, structure similarity search, and chemical informatics workflows.
When to Use This Skill
Choose PubChem Database Engine when you need to:
- Look up chemical properties, structures, and identifiers for compounds
- Search compounds by name, SMILES, InChI, or molecular formula
- Retrieve bioactivity assay data for drug discovery research
- Perform structure and substructure similarity searches across the database
Consider alternatives when:
- You need drug-target binding data specifically (use ChEMBL)
- You need metabolomics pathway data (use HMDB or KEGG)
- You need protein structure data (use PDB)
Quick Start
pip install pubchempy requests pandas
import pubchempy as pcp # Search by name compounds = pcp.get_compounds("aspirin", "name") aspirin = compounds[0] print(f"Name: {aspirin.iupac_name}") print(f"CID: {aspirin.cid}") print(f"Formula: {aspirin.molecular_formula}") print(f"MW: {aspirin.molecular_weight}") print(f"SMILES: {aspirin.canonical_smiles}") print(f"XLogP: {aspirin.xlogp}") # Search by SMILES results = pcp.get_compounds("CC(=O)OC1=CC=CC=C1C(=O)O", "smiles") print(f"Found {len(results)} matches")
Core Concepts
Search Types
| Method | Input | Description |
|---|---|---|
get_compounds(name, "name") | Chemical name | Name-based lookup |
get_compounds(smiles, "smiles") | SMILES string | Exact structure match |
get_compounds(inchi, "inchi") | InChI string | Standard identifier lookup |
get_compounds(formula, "formula") | Molecular formula | Formula-based search |
get_compounds(cid, "cid") | PubChem CID | Direct ID lookup |
Available Properties
| Property | Attribute | Description |
|---|---|---|
molecular_weight | Float | Molecular weight (g/mol) |
molecular_formula | String | Chemical formula |
canonical_smiles | String | Canonical SMILES |
isomeric_smiles | String | SMILES with stereochemistry |
xlogp | Float | Octanol-water partition coefficient |
tpsa | Float | Topological polar surface area |
h_bond_donor_count | Int | Hydrogen bond donors |
h_bond_acceptor_count | Int | Hydrogen bond acceptors |
rotatable_bond_count | Int | Rotatable bonds |
exact_mass | Float | Monoisotopic mass |
Bulk Property Retrieval
import pubchempy as pcp import pandas as pd def batch_compound_properties(names): """Retrieve properties for a list of compound names.""" results = [] for name in names: try: compounds = pcp.get_compounds(name, "name") if compounds: c = compounds[0] results.append({ "name": name, "cid": c.cid, "formula": c.molecular_formula, "mw": c.molecular_weight, "smiles": c.canonical_smiles, "xlogp": c.xlogp, "tpsa": c.tpsa, "hbd": c.h_bond_donor_count, "hba": c.h_bond_acceptor_count, "rotatable_bonds": c.rotatable_bond_count }) except Exception as e: results.append({"name": name, "error": str(e)}) return pd.DataFrame(results) # Get properties for common drugs drugs = ["aspirin", "ibuprofen", "acetaminophen", "caffeine", "metformin"] df = batch_compound_properties(drugs) print(df[["name", "mw", "xlogp", "tpsa"]].to_string(index=False))
Similarity Search
import requests import pandas as pd def similarity_search(smiles, threshold=90, max_results=50): """Find structurally similar compounds in PubChem.""" url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/similarity/smiles" response = requests.post( f"{url}/{smiles}/JSON", params={ "Threshold": threshold, "MaxRecords": max_results } ) if response.status_code == 200: data = response.json() cids = data.get("IdentifierList", {}).get("CID", []) # Get properties for similar compounds if cids: props = requests.get( "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/" f"{','.join(map(str, cids[:20]))}/property/" "MolecularFormula,MolecularWeight,CanonicalSMILES,XLogP/JSON" ) return pd.DataFrame( props.json()["PropertyTable"]["Properties"] ) return pd.DataFrame() # Find compounds similar to caffeine similar = similarity_search("CN1C=NC2=C1C(=O)N(C(=O)N2C)C", threshold=85) print(f"Found {len(similar)} similar compounds") print(similar.head(10))
Configuration
| Parameter | Description | Default |
|---|---|---|
search_type | Lookup method (name, smiles, cid, etc.) | "name" |
max_records | Maximum results to return | 100 |
similarity_threshold | Tanimoto similarity cutoff (%) | 90 |
properties | Properties to retrieve | All available |
output_format | Response format (JSON, CSV, SDF) | "JSON" |
timeout | Request timeout (seconds) | 30 |
Best Practices
-
Search by SMILES for exact structure matching — Name-based searches can return multiple compounds due to synonyms and naming ambiguity. Use canonical SMILES for unambiguous structure lookups and verify the returned CID matches your intended compound.
-
Use the PUG REST API for bulk operations — PubChemPy is convenient for small queries but the PUG REST API is more efficient for batch operations. It supports comma-separated CID lists and returns multiple properties in a single request.
-
Cache compound data for repeated analyses — PubChem API calls are rate-limited to 5 requests per second. Cache CID-to-property mappings locally in a SQLite database or dictionary for datasets you query repeatedly.
-
Handle isomers explicitly — Many compound names map to multiple stereoisomers. Use
isomeric_smilesinstead ofcanonical_smileswhen stereochemistry matters for your application (e.g., drug activity prediction). -
Cross-reference with other databases using InChIKey — PubChem CIDs are PubChem-specific. Use the InChIKey (standard hash of the structure) to link compounds across PubChem, ChEMBL, DrugBank, and other chemical databases unambiguously.
Common Issues
Name search returns no results — PubChem is strict about chemical names. Try alternative names, trade names, or search by molecular formula instead. For peptides and biologics, PubChem may not have entries — these are better found in UniProt or DrugBank.
Rate limiting causes request failures — PubChem allows a maximum of 5 requests per second. Implement a rate limiter with time.sleep(0.2) between requests. For large-scale data retrieval, use the PubChem FTP download files instead of the API.
SMILES parsing differences between tools — PubChem's canonical SMILES may differ from RDKit's canonicalization. When comparing structures, convert both to InChIKey which is canonicalization-independent, or use PubChem's own standardization endpoint to normalize before comparison.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.