PDB Database Complete

Query and analyze 3D structural data from the RCSB Protein Data Bank (PDB), the worldwide repository of macromolecular structures. This skill covers structure search, coordinate retrieval, sequence-structure analysis, visualization preparation, and structural bioinformatics workflows.

When to Use This Skill

Choose PDB Database Complete when you need to:

Search for protein or nucleic acid structures by sequence, name, or function
Retrieve atomic coordinates and structural metadata for analysis
Perform structure-based drug design or binding site analysis
Compare structures through alignment and RMSD calculations

Consider alternatives when:

You need to predict structures from sequence (use AlphaFold or ESMFold)
You need molecular dynamics simulations (use GROMACS or OpenMM)
You need cryo-EM density map analysis (use EMDB directly)

Quick Start


pip install biotite requests pandas


import biotite.database.rcsb as rcsb
import biotite.structure.io.pdb as pdb
import biotite.structure as struc

# Search for structures
query = rcsb.BasicQuery("insulin")
pdb_ids = rcsb.search(query)
print(f"Found {len(pdb_ids)} insulin structures")

# Download and parse a structure
file = rcsb.fetch("4HHB", "pdb")  # Hemoglobin
structure = pdb.PDBFile.read(file).get_structure(model=1)

print(f"Atoms: {structure.array_length()}")
print(f"Chains: {struc.get_chains(structure)}")
print(f"Residues: {struc.get_residue_count(structure)}")

Core Concepts

Search Query Types

Query Type	Description	Example
`BasicQuery`	Text search across all fields	`"kinase inhibitor"`
`SequenceQuery`	BLAST-like sequence search	Protein/DNA sequence
`StructureQuery`	3D structure similarity	PDB ID reference
`AttributeQuery`	Filter by metadata fields	Resolution, method, organism
`ChemicalQuery`	Search by ligand or compound	SMILES, InChI

Structure Analysis


import biotite.database.rcsb as rcsb
import biotite.structure.io.mmcif as mmcif
import biotite.structure as struc
import numpy as np

def analyze_structure(pdb_id):
    """Comprehensive structural analysis of a PDB entry."""
    # Download mmCIF format (more complete than PDB)
    file = rcsb.fetch(pdb_id, "cif")
    structure = mmcif.CIFFile.read(file).get_structure(model=1)

    # Basic statistics
    info = {
        "pdb_id": pdb_id,
        "total_atoms": structure.array_length(),
        "chains": list(struc.get_chains(structure)),
        "residue_count": struc.get_residue_count(structure),
    }

    # Separate protein from ligands and water
    protein = structure[struc.filter_amino_acids(structure)]
    water = structure[structure.res_name == "HOH"]
    ligands = structure[
        ~struc.filter_amino_acids(structure) &
        (structure.res_name != "HOH") &
        (structure.hetero == True)
    ]

    info["protein_residues"] = struc.get_residue_count(protein)
    info["water_molecules"] = len(set(zip(water.chain_id, water.res_id)))
    info["ligand_names"] = list(set(ligands.res_name))

    # Secondary structure composition
    sse = struc.annotate_sse(protein)
    helix_frac = np.sum(sse == 'a') / len(sse)
    sheet_frac = np.sum(sse == 'b') / len(sse)
    info["helix_fraction"] = f"{helix_frac:.1%}"
    info["sheet_fraction"] = f"{sheet_frac:.1%}"

    return info

info = analyze_structure("4HHB")
for key, val in info.items():
    print(f"{key}: {val}")

Binding Site Analysis


import biotite.structure as struc
import numpy as np

def find_binding_site(structure, ligand_name, distance_cutoff=5.0):
    """Identify residues within distance of a ligand."""
    protein = structure[struc.filter_amino_acids(structure)]
    ligand = structure[structure.res_name == ligand_name]

    if len(ligand) == 0:
        raise ValueError(f"Ligand {ligand_name} not found")

    # Find protein atoms within cutoff of any ligand atom
    binding_site_mask = np.zeros(len(protein), dtype=bool)
    for lig_atom in range(len(ligand)):
        distances = struc.distance(
            protein, ligand[lig_atom]
        )
        binding_site_mask |= (distances <= distance_cutoff)

    binding_residues = protein[binding_site_mask]

    # Get unique residues
    residue_ids = set(zip(
        binding_residues.chain_id,
        binding_residues.res_id,
        binding_residues.res_name
    ))

    print(f"Binding site for {ligand_name}:")
    print(f"  Contact residues: {len(residue_ids)}")
    for chain, resid, resname in sorted(residue_ids):
        print(f"  {chain}:{resname}{resid}")

    return binding_residues, residue_ids

# Find binding site residues
binding, residues = find_binding_site(structure, "HEM")

Configuration

Parameter	Description	Default
`format`	Download format (pdb, cif, xml)	`"cif"`
`model`	Model number for NMR structures	`1`
`distance_cutoff`	Binding site distance threshold (Å)	`5.0`
`sequence_identity`	Redundancy filter threshold	`0.9`
`resolution_max`	Maximum resolution filter (Å)	`3.0`
`experimental_method`	Method filter (X-ray, NMR, cryo-EM)	All

Best Practices

Use mmCIF format over legacy PDB format — The PDB flat file format has limitations (100K atoms, chain ID restrictions). mmCIF format supports unlimited atoms, multi-character chain IDs, and richer metadata. All new PDB depositions are in mmCIF format.
Filter by resolution for structure quality — For structure-based drug design, use structures with resolution ≤2.5 Å. For general analysis, ≤3.0 Å is acceptable. Always check the R-free value as an independent quality indicator — values above 0.35 suggest potential issues.
Handle multiple models in NMR structures — NMR structures contain an ensemble of models (typically 20). Either analyze all models and report statistics, or use model 1 (the energy-minimized representative). Don't average coordinates across NMR models — that creates physically impossible structures.
Remove alternate conformations before analysis — Many crystal structures contain alternate side-chain conformations (altlocs A, B). Pick the highest-occupancy conformer or conformer A by default. Keeping both creates duplicate atoms that break distance calculations and structure comparisons.
Use sequence-based search for finding homologs — Text search finds structures by annotation, which can miss unannotated or differently named entries. Use SequenceQuery with your protein sequence to find all structural homologs regardless of how they were annotated in the database.

Common Issues

Structure has missing residues (gaps in sequence) — Many crystal structures have disordered loops that couldn't be resolved. Check for gaps in residue numbering and consult the _pdbx_unobs_or_zero_occ_residues records in mmCIF. If you need a complete model, use AlphaFold to fill in missing regions.

Ligand coordinates are in a separate entity — In mmCIF files, ligands may be in a different entity or assembly than the protein chains. Use structure.hetero == True combined with res_name filtering rather than chain-based selection to locate ligands.

RMSD calculation gives unexpectedly high values — Structural alignment requires matching equivalent atoms before computing RMSD. Use sequence alignment to establish residue correspondence, then extract backbone CA atoms from both structures. Computing RMSD on all atoms without alignment gives meaningless numbers.

⚠️ Loading Issue

Pdb Database Complete

PDB Database Complete

When to Use This Skill

Quick Start

Core Concepts

Search Query Types

Structure Analysis

Binding Site Analysis

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace