P

Pdb Database Complete

Powerful skill for access, rcsb, protein, nucleic. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

PDB Database Complete

Query and analyze 3D structural data from the RCSB Protein Data Bank (PDB), the worldwide repository of macromolecular structures. This skill covers structure search, coordinate retrieval, sequence-structure analysis, visualization preparation, and structural bioinformatics workflows.

When to Use This Skill

Choose PDB Database Complete when you need to:

  • Search for protein or nucleic acid structures by sequence, name, or function
  • Retrieve atomic coordinates and structural metadata for analysis
  • Perform structure-based drug design or binding site analysis
  • Compare structures through alignment and RMSD calculations

Consider alternatives when:

  • You need to predict structures from sequence (use AlphaFold or ESMFold)
  • You need molecular dynamics simulations (use GROMACS or OpenMM)
  • You need cryo-EM density map analysis (use EMDB directly)

Quick Start

pip install biotite requests pandas
import biotite.database.rcsb as rcsb import biotite.structure.io.pdb as pdb import biotite.structure as struc # Search for structures query = rcsb.BasicQuery("insulin") pdb_ids = rcsb.search(query) print(f"Found {len(pdb_ids)} insulin structures") # Download and parse a structure file = rcsb.fetch("4HHB", "pdb") # Hemoglobin structure = pdb.PDBFile.read(file).get_structure(model=1) print(f"Atoms: {structure.array_length()}") print(f"Chains: {struc.get_chains(structure)}") print(f"Residues: {struc.get_residue_count(structure)}")

Core Concepts

Search Query Types

Query TypeDescriptionExample
BasicQueryText search across all fields"kinase inhibitor"
SequenceQueryBLAST-like sequence searchProtein/DNA sequence
StructureQuery3D structure similarityPDB ID reference
AttributeQueryFilter by metadata fieldsResolution, method, organism
ChemicalQuerySearch by ligand or compoundSMILES, InChI

Structure Analysis

import biotite.database.rcsb as rcsb import biotite.structure.io.mmcif as mmcif import biotite.structure as struc import numpy as np def analyze_structure(pdb_id): """Comprehensive structural analysis of a PDB entry.""" # Download mmCIF format (more complete than PDB) file = rcsb.fetch(pdb_id, "cif") structure = mmcif.CIFFile.read(file).get_structure(model=1) # Basic statistics info = { "pdb_id": pdb_id, "total_atoms": structure.array_length(), "chains": list(struc.get_chains(structure)), "residue_count": struc.get_residue_count(structure), } # Separate protein from ligands and water protein = structure[struc.filter_amino_acids(structure)] water = structure[structure.res_name == "HOH"] ligands = structure[ ~struc.filter_amino_acids(structure) & (structure.res_name != "HOH") & (structure.hetero == True) ] info["protein_residues"] = struc.get_residue_count(protein) info["water_molecules"] = len(set(zip(water.chain_id, water.res_id))) info["ligand_names"] = list(set(ligands.res_name)) # Secondary structure composition sse = struc.annotate_sse(protein) helix_frac = np.sum(sse == 'a') / len(sse) sheet_frac = np.sum(sse == 'b') / len(sse) info["helix_fraction"] = f"{helix_frac:.1%}" info["sheet_fraction"] = f"{sheet_frac:.1%}" return info info = analyze_structure("4HHB") for key, val in info.items(): print(f"{key}: {val}")

Binding Site Analysis

import biotite.structure as struc import numpy as np def find_binding_site(structure, ligand_name, distance_cutoff=5.0): """Identify residues within distance of a ligand.""" protein = structure[struc.filter_amino_acids(structure)] ligand = structure[structure.res_name == ligand_name] if len(ligand) == 0: raise ValueError(f"Ligand {ligand_name} not found") # Find protein atoms within cutoff of any ligand atom binding_site_mask = np.zeros(len(protein), dtype=bool) for lig_atom in range(len(ligand)): distances = struc.distance( protein, ligand[lig_atom] ) binding_site_mask |= (distances <= distance_cutoff) binding_residues = protein[binding_site_mask] # Get unique residues residue_ids = set(zip( binding_residues.chain_id, binding_residues.res_id, binding_residues.res_name )) print(f"Binding site for {ligand_name}:") print(f" Contact residues: {len(residue_ids)}") for chain, resid, resname in sorted(residue_ids): print(f" {chain}:{resname}{resid}") return binding_residues, residue_ids # Find binding site residues binding, residues = find_binding_site(structure, "HEM")

Configuration

ParameterDescriptionDefault
formatDownload format (pdb, cif, xml)"cif"
modelModel number for NMR structures1
distance_cutoffBinding site distance threshold (Å)5.0
sequence_identityRedundancy filter threshold0.9
resolution_maxMaximum resolution filter (Å)3.0
experimental_methodMethod filter (X-ray, NMR, cryo-EM)All

Best Practices

  1. Use mmCIF format over legacy PDB format — The PDB flat file format has limitations (100K atoms, chain ID restrictions). mmCIF format supports unlimited atoms, multi-character chain IDs, and richer metadata. All new PDB depositions are in mmCIF format.

  2. Filter by resolution for structure quality — For structure-based drug design, use structures with resolution ≤2.5 Å. For general analysis, ≤3.0 Å is acceptable. Always check the R-free value as an independent quality indicator — values above 0.35 suggest potential issues.

  3. Handle multiple models in NMR structures — NMR structures contain an ensemble of models (typically 20). Either analyze all models and report statistics, or use model 1 (the energy-minimized representative). Don't average coordinates across NMR models — that creates physically impossible structures.

  4. Remove alternate conformations before analysis — Many crystal structures contain alternate side-chain conformations (altlocs A, B). Pick the highest-occupancy conformer or conformer A by default. Keeping both creates duplicate atoms that break distance calculations and structure comparisons.

  5. Use sequence-based search for finding homologs — Text search finds structures by annotation, which can miss unannotated or differently named entries. Use SequenceQuery with your protein sequence to find all structural homologs regardless of how they were annotated in the database.

Common Issues

Structure has missing residues (gaps in sequence) — Many crystal structures have disordered loops that couldn't be resolved. Check for gaps in residue numbering and consult the _pdbx_unobs_or_zero_occ_residues records in mmCIF. If you need a complete model, use AlphaFold to fill in missing regions.

Ligand coordinates are in a separate entity — In mmCIF files, ligands may be in a different entity or assembly than the protein chains. Use structure.hetero == True combined with res_name filtering rather than chain-based selection to locate ligands.

RMSD calculation gives unexpectedly high values — Structural alignment requires matching equivalent atoms before computing RMSD. Use sequence alignment to establish residue correspondence, then extract backbone CA atoms from both structures. Computing RMSD on all atoms without alignment gives meaningless numbers.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates