Pdb Database Complete
Powerful skill for access, rcsb, protein, nucleic. Includes structured workflows, validation checks, and reusable patterns for scientific.
PDB Database Complete
Query and analyze 3D structural data from the RCSB Protein Data Bank (PDB), the worldwide repository of macromolecular structures. This skill covers structure search, coordinate retrieval, sequence-structure analysis, visualization preparation, and structural bioinformatics workflows.
When to Use This Skill
Choose PDB Database Complete when you need to:
- Search for protein or nucleic acid structures by sequence, name, or function
- Retrieve atomic coordinates and structural metadata for analysis
- Perform structure-based drug design or binding site analysis
- Compare structures through alignment and RMSD calculations
Consider alternatives when:
- You need to predict structures from sequence (use AlphaFold or ESMFold)
- You need molecular dynamics simulations (use GROMACS or OpenMM)
- You need cryo-EM density map analysis (use EMDB directly)
Quick Start
pip install biotite requests pandas
import biotite.database.rcsb as rcsb import biotite.structure.io.pdb as pdb import biotite.structure as struc # Search for structures query = rcsb.BasicQuery("insulin") pdb_ids = rcsb.search(query) print(f"Found {len(pdb_ids)} insulin structures") # Download and parse a structure file = rcsb.fetch("4HHB", "pdb") # Hemoglobin structure = pdb.PDBFile.read(file).get_structure(model=1) print(f"Atoms: {structure.array_length()}") print(f"Chains: {struc.get_chains(structure)}") print(f"Residues: {struc.get_residue_count(structure)}")
Core Concepts
Search Query Types
| Query Type | Description | Example |
|---|---|---|
BasicQuery | Text search across all fields | "kinase inhibitor" |
SequenceQuery | BLAST-like sequence search | Protein/DNA sequence |
StructureQuery | 3D structure similarity | PDB ID reference |
AttributeQuery | Filter by metadata fields | Resolution, method, organism |
ChemicalQuery | Search by ligand or compound | SMILES, InChI |
Structure Analysis
import biotite.database.rcsb as rcsb import biotite.structure.io.mmcif as mmcif import biotite.structure as struc import numpy as np def analyze_structure(pdb_id): """Comprehensive structural analysis of a PDB entry.""" # Download mmCIF format (more complete than PDB) file = rcsb.fetch(pdb_id, "cif") structure = mmcif.CIFFile.read(file).get_structure(model=1) # Basic statistics info = { "pdb_id": pdb_id, "total_atoms": structure.array_length(), "chains": list(struc.get_chains(structure)), "residue_count": struc.get_residue_count(structure), } # Separate protein from ligands and water protein = structure[struc.filter_amino_acids(structure)] water = structure[structure.res_name == "HOH"] ligands = structure[ ~struc.filter_amino_acids(structure) & (structure.res_name != "HOH") & (structure.hetero == True) ] info["protein_residues"] = struc.get_residue_count(protein) info["water_molecules"] = len(set(zip(water.chain_id, water.res_id))) info["ligand_names"] = list(set(ligands.res_name)) # Secondary structure composition sse = struc.annotate_sse(protein) helix_frac = np.sum(sse == 'a') / len(sse) sheet_frac = np.sum(sse == 'b') / len(sse) info["helix_fraction"] = f"{helix_frac:.1%}" info["sheet_fraction"] = f"{sheet_frac:.1%}" return info info = analyze_structure("4HHB") for key, val in info.items(): print(f"{key}: {val}")
Binding Site Analysis
import biotite.structure as struc import numpy as np def find_binding_site(structure, ligand_name, distance_cutoff=5.0): """Identify residues within distance of a ligand.""" protein = structure[struc.filter_amino_acids(structure)] ligand = structure[structure.res_name == ligand_name] if len(ligand) == 0: raise ValueError(f"Ligand {ligand_name} not found") # Find protein atoms within cutoff of any ligand atom binding_site_mask = np.zeros(len(protein), dtype=bool) for lig_atom in range(len(ligand)): distances = struc.distance( protein, ligand[lig_atom] ) binding_site_mask |= (distances <= distance_cutoff) binding_residues = protein[binding_site_mask] # Get unique residues residue_ids = set(zip( binding_residues.chain_id, binding_residues.res_id, binding_residues.res_name )) print(f"Binding site for {ligand_name}:") print(f" Contact residues: {len(residue_ids)}") for chain, resid, resname in sorted(residue_ids): print(f" {chain}:{resname}{resid}") return binding_residues, residue_ids # Find binding site residues binding, residues = find_binding_site(structure, "HEM")
Configuration
| Parameter | Description | Default |
|---|---|---|
format | Download format (pdb, cif, xml) | "cif" |
model | Model number for NMR structures | 1 |
distance_cutoff | Binding site distance threshold (Å) | 5.0 |
sequence_identity | Redundancy filter threshold | 0.9 |
resolution_max | Maximum resolution filter (Å) | 3.0 |
experimental_method | Method filter (X-ray, NMR, cryo-EM) | All |
Best Practices
-
Use mmCIF format over legacy PDB format — The PDB flat file format has limitations (100K atoms, chain ID restrictions). mmCIF format supports unlimited atoms, multi-character chain IDs, and richer metadata. All new PDB depositions are in mmCIF format.
-
Filter by resolution for structure quality — For structure-based drug design, use structures with resolution ≤2.5 Å. For general analysis, ≤3.0 Å is acceptable. Always check the R-free value as an independent quality indicator — values above 0.35 suggest potential issues.
-
Handle multiple models in NMR structures — NMR structures contain an ensemble of models (typically 20). Either analyze all models and report statistics, or use model 1 (the energy-minimized representative). Don't average coordinates across NMR models — that creates physically impossible structures.
-
Remove alternate conformations before analysis — Many crystal structures contain alternate side-chain conformations (altlocs A, B). Pick the highest-occupancy conformer or conformer A by default. Keeping both creates duplicate atoms that break distance calculations and structure comparisons.
-
Use sequence-based search for finding homologs — Text search finds structures by annotation, which can miss unannotated or differently named entries. Use
SequenceQuerywith your protein sequence to find all structural homologs regardless of how they were annotated in the database.
Common Issues
Structure has missing residues (gaps in sequence) — Many crystal structures have disordered loops that couldn't be resolved. Check for gaps in residue numbering and consult the _pdbx_unobs_or_zero_occ_residues records in mmCIF. If you need a complete model, use AlphaFold to fill in missing regions.
Ligand coordinates are in a separate entity — In mmCIF files, ligands may be in a different entity or assembly than the protein chains. Use structure.hetero == True combined with res_name filtering rather than chain-based selection to locate ligands.
RMSD calculation gives unexpectedly high values — Structural alignment requires matching equivalent atoms before computing RMSD. Use sequence alignment to establish residue correspondence, then extract backbone CA atoms from both structures. Computing RMSD on all atoms without alignment gives meaningless numbers.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.