Z

Zinc Database Smart

Production-ready skill that handles access, zinc, purchasable, compounds. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Zinc Database Smart

Search and retrieve chemical compounds from the ZINC database, a freely accessible repository of 230M+ commercially available molecules for virtual screening and drug discovery. This skill covers ZINC API queries, SMILES-based search, substructure and similarity filtering, 3D conformer retrieval, and integration with docking workflows.

When to Use This Skill

Choose Zinc Database Smart when you need to:

  • Search for purchasable compounds by structure, similarity, or molecular properties
  • Download 3D-ready conformers for molecular docking campaigns
  • Filter compounds by drug-likeness (Lipinski rules), reactivity, or availability
  • Build compound libraries for high-throughput virtual screening

Consider alternatives when:

  • You need bioactivity data for known compounds (use PubChem or ChEMBL)
  • You need protein-ligand interaction prediction (use docking software directly)
  • You need compound synthesis routes (use retrosynthesis tools)

Quick Start

pip install requests rdkit pandas
import requests import pandas as pd ZINC_API = "https://zinc15.docking.org" def search_zinc_by_smiles(smiles, similarity=0.7, max_results=20): """Search ZINC for compounds similar to a query molecule.""" params = { "smiles": smiles, "similarity": similarity, "count": max_results, "output_format": "json", } response = requests.get(f"{ZINC_API}/substances/search/", params=params) response.raise_for_status() return response.json() def get_compound(zinc_id): """Retrieve detailed compound information by ZINC ID.""" response = requests.get(f"{ZINC_API}/substances/{zinc_id}.json") response.raise_for_status() return response.json() # Search for compounds similar to aspirin aspirin_smiles = "CC(=O)OC1=CC=CC=C1C(=O)O" print(f"Searching for compounds similar to aspirin...") # results = search_zinc_by_smiles(aspirin_smiles, similarity=0.6) # Drug-likeness filter (Lipinski's Rule of Five) from rdkit import Chem from rdkit.Chem import Descriptors def check_lipinski(smiles): mol = Chem.MolFromSmiles(smiles) if mol is None: return None return { "MW": Descriptors.MolWt(mol), "LogP": Descriptors.MolLogP(mol), "HBD": Descriptors.NumHDonors(mol), "HBA": Descriptors.NumHAcceptors(mol), "passes": (Descriptors.MolWt(mol) <= 500 and Descriptors.MolLogP(mol) <= 5 and Descriptors.NumHDonors(mol) <= 5 and Descriptors.NumHAcceptors(mol) <= 10), } props = check_lipinski(aspirin_smiles) print(f"Aspirin: MW={props['MW']:.1f}, LogP={props['LogP']:.2f}, " f"HBD={props['HBD']}, HBA={props['HBA']}, Lipinski={props['passes']}")

Core Concepts

ZINC Subsets and Tranches

SubsetDescriptionSize
ZINC-AllComplete database~230M compounds
ZINC-In-StockImmediately purchasable~10M
ZINC-Drug-LikeLipinski-compliant~20M
ZINC-Lead-Like250 < MW < 350, LogP < 3.5~6M
ZINC-Fragment-LikeMW < 250, LogP < 2.5~1M
ZINC-Goldilocks200 < MW < 500, -2 < LogP < 5~15M

Virtual Screening Library Builder

from rdkit import Chem from rdkit.Chem import Descriptors, AllChem, DataStructs from rdkit.Chem import FilterCatalog import numpy as np class ScreeningLibrary: """Build and filter compound libraries for virtual screening.""" def __init__(self): self.compounds = [] def add_smiles(self, smiles_list, ids=None): for i, smi in enumerate(smiles_list): mol = Chem.MolFromSmiles(smi) if mol is None: continue compound = { 'id': ids[i] if ids else f'CPD-{i}', 'smiles': Chem.MolToSmiles(mol), # Canonical 'mol': mol, 'mw': Descriptors.MolWt(mol), 'logp': Descriptors.MolLogP(mol), 'hbd': Descriptors.NumHDonors(mol), 'hba': Descriptors.NumHAcceptors(mol), 'tpsa': Descriptors.TPSA(mol), 'rotatable_bonds': Descriptors.NumRotatableBonds(mol), } self.compounds.append(compound) def filter_druglike(self, rule="lipinski"): """Filter by drug-likeness rules.""" if rule == "lipinski": return [c for c in self.compounds if ( c['mw'] <= 500 and c['logp'] <= 5 and c['hbd'] <= 5 and c['hba'] <= 10 )] elif rule == "veber": return [c for c in self.compounds if ( c['tpsa'] <= 140 and c['rotatable_bonds'] <= 10 )] def deduplicate(self, threshold=0.95): """Remove near-duplicate compounds by Tanimoto similarity.""" fps = [AllChem.GetMorganFingerprintAsBitVect(c['mol'], 2, nBits=2048) for c in self.compounds] keep = [True] * len(self.compounds) for i in range(len(fps)): if not keep[i]: continue for j in range(i+1, len(fps)): if not keep[j]: continue sim = DataStructs.TanimotoSimilarity(fps[i], fps[j]) if sim >= threshold: keep[j] = False return [c for c, k in zip(self.compounds, keep) if k] def diversity_select(self, n_select, fps=None): """Select diverse subset using MaxMin algorithm.""" if fps is None: fps = [AllChem.GetMorganFingerprintAsBitVect(c['mol'], 2, nBits=2048) for c in self.compounds] selected = [0] remaining = set(range(1, len(fps))) while len(selected) < n_select and remaining: max_min_dist = -1 best_idx = None for idx in remaining: min_dist = min( 1 - DataStructs.TanimotoSimilarity(fps[idx], fps[s]) for s in selected ) if min_dist > max_min_dist: max_min_dist = min_dist best_idx = idx selected.append(best_idx) remaining.discard(best_idx) return [self.compounds[i] for i in selected] # Usage lib = ScreeningLibrary() lib.add_smiles([ "CC(=O)OC1=CC=CC=C1C(=O)O", # Aspirin "CC1=CC=C(C=C1)C(C)C(=O)O", # Ibuprofen "OC(=O)C1=CC=CC=C1O", # Salicylic acid "CC(=O)NC1=CC=C(C=C1)O", # Acetaminophen ]) druglike = lib.filter_druglike("lipinski") print(f"Drug-like compounds: {len(druglike)}/{len(lib.compounds)}")

Configuration

ParameterDescriptionDefault
api_urlZINC API base URL"https://zinc15.docking.org"
similarity_thresholdMinimum Tanimoto similarity for search0.7
output_formatResponse format (json, smi, sdf, mol2)"json"
subsetZINC subset to search"drug-like"
max_resultsMaximum results per query100
purchasabilityFilter by availability (in-stock, make-on-demand)"all"
mw_rangeMolecular weight filter[150, 500]
logp_rangeLogP filter[-2, 5]

Best Practices

  1. Start with lead-like or fragment-like subsets for early discovery — The full 230M compound database is too large for exhaustive screening. Start with ZINC-Lead-Like (6M, optimized for hit-to-lead optimization) or ZINC-Fragment-Like (1M, for fragment-based drug design) for focused campaigns.

  2. Deduplicate before virtual screening — ZINC contains many near-identical compounds (stereoisomers, salt forms). Remove compounds with Tanimoto similarity > 0.95 to avoid wasting computational resources docking essentially the same molecule multiple times.

  3. Use Morgan fingerprints (radius=2) for similarity searching — Morgan fingerprints (ECFP4 equivalent) capture local chemical environments and are the standard for similarity-based virtual screening. Radius 2 captures up to 4-bond-diameter substructures, balancing specificity and generalization.

  4. Verify purchasability before ordering hits — Computational hits are worthless if compounds can't be obtained. Filter by "in-stock" status for immediate availability or "make-on-demand" for 4-8 week lead times. Check vendor catalogs directly as ZINC availability data may be outdated.

  5. Apply diversity selection for screening libraries — Rather than screening all similar compounds, use MaxMin or other diversity selection algorithms to pick a maximally diverse subset. A diverse library of 10K compounds often outperforms a redundant library of 100K in hit discovery.

Common Issues

ZINC API returns HTTP 503 during peak hours — ZINC is an academic resource with limited server capacity. Implement retry logic with exponential backoff and avoid bulk downloads during US business hours. For large-scale downloads, use the ZINC tranches file system directly.

SMILES strings don't match between ZINC and other databases — Different databases use different SMILES canonicalization. Always convert to canonical SMILES with RDKit before comparison: Chem.MolToSmiles(Chem.MolFromSmiles(smiles)). This normalizes tautomers, stereochemistry, and atom ordering.

3D conformers have unrealistic geometry — ZINC provides pre-generated 3D conformers that may not be the lowest-energy conformation. Re-optimize geometries with RDKit: AllChem.EmbedMolecule(mol) followed by AllChem.MMFFOptimizeMolecule(mol) before docking.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates