Advanced Uniprot Database

Query and analyze protein data from UniProt, the world's most comprehensive protein sequence and functional annotation database. This skill covers REST API queries, programmatic sequence retrieval, protein feature extraction, cross-reference mapping, and batch processing for proteomics research.

When to Use This Skill

Choose Advanced Uniprot Database when you need to:

Retrieve protein sequences, functions, and annotations programmatically
Map between protein identifiers (UniProt, PDB, RefSeq, Gene names)
Search for proteins by function, taxonomy, subcellular location, or disease association
Download and process bulk protein data for bioinformatics analysis

Consider alternatives when:

You need protein 3D structure analysis (use PDB Database Complete)
You need protein-protein interaction networks (use STRING or IntAct)
You need gene-level information rather than protein (use OpenAlex or NCBI Gene)

Quick Start


pip install requests pandas biopython


import requests
import pandas as pd

UNIPROT_API = "https://rest.uniprot.org"

def search_uniprot(query, format="json", size=10, fields=None):
    """Search UniProt with the REST API."""
    params = {
        "query": query,
        "format": format,
        "size": size,
    }
    if fields:
        params["fields"] = ",".join(fields)

    response = requests.get(f"{UNIPROT_API}/uniprotkb/search", params=params)
    response.raise_for_status()
    return response.json()

# Search for human insulin-related proteins
results = search_uniprot(
    query="insulin AND organism_id:9606 AND reviewed:true",
    fields=["accession", "protein_name", "gene_names", "length",
            "cc_function", "cc_subcellular_location"],
    size=5
)

for entry in results["results"]:
    acc = entry["primaryAccession"]
    name = entry["proteinDescription"]["recommendedName"]["fullName"]["value"]
    genes = entry.get("genes", [{}])
    gene = genes[0].get("geneName", {}).get("value", "N/A") if genes else "N/A"
    length = entry["sequence"]["length"]
    print(f"{acc} | {gene} | {name} | {length} aa")

# Get single protein by accession
def get_protein(accession, format="json"):
    response = requests.get(f"{UNIPROT_API}/uniprotkb/{accession}")
    response.raise_for_status()
    return response.json()

p53 = get_protein("P04637")
print(f"\nTP53: {p53['proteinDescription']['recommendedName']['fullName']['value']}")
print(f"Length: {p53['sequence']['length']} aa")
print(f"Sequence: {p53['sequence']['value'][:60]}...")

Core Concepts

UniProt Query Fields

Field	Description	Example Query
`protein_name`	Protein name	`protein_name:kinase`
`gene`	Gene name	`gene:TP53`
`organism_id`	NCBI taxonomy ID	`organism_id:9606` (human)
`reviewed`	Swiss-Prot (reviewed) only	`reviewed:true`
`ec`	Enzyme classification	`ec:2.7.11.1`
`keyword`	UniProt keywords	`keyword:phosphoprotein`
`cc_subcellular_location`	Cellular location	`cc_subcellular_location:nucleus`
`cc_disease`	Disease association	`cc_disease:cancer`
`length`	Sequence length range	`length:[100 TO 500]`
`ft_domain`	Protein domains	`ft_domain:kinase`

Batch ID Mapping


import requests
import time

def map_ids(from_db, to_db, ids):
    """Map between database identifiers via UniProt ID Mapping."""
    # Submit mapping job
    response = requests.post(
        f"https://rest.uniprot.org/idmapping/run",
        data={"from": from_db, "to": to_db, "ids": ",".join(ids)}
    )
    response.raise_for_status()
    job_id = response.json()["jobId"]

    # Poll for results
    while True:
        status = requests.get(
            f"https://rest.uniprot.org/idmapping/status/{job_id}"
        )
        status_data = status.json()

        if "jobStatus" in status_data:
            if status_data["jobStatus"] == "RUNNING":
                time.sleep(2)
                continue
        if "results" in status_data or "redirectURL" in status_data:
            break
        time.sleep(2)

    # Get results
    results_url = f"https://rest.uniprot.org/idmapping/uniprotkb/results/{job_id}"
    results = requests.get(results_url).json()

    mapping = {}
    for result in results.get("results", []):
        from_id = result["from"]
        to_id = result["to"]["primaryAccession"]
        mapping[from_id] = to_id

    return mapping

# Map gene names to UniProt accessions
genes = ["TP53", "BRCA1", "EGFR", "KRAS", "MYC"]
mapping = map_ids("Gene_Name", "UniProtKB-Swiss-Prot", genes)
for gene, accession in mapping.items():
    print(f"  {gene} → {accession}")

Configuration

Parameter	Description	Default
`base_url`	UniProt REST API endpoint	`"https://rest.uniprot.org"`
`format`	Response format (json, tsv, fasta, xml, gff)	`"json"`
`size`	Results per page	`25`
`reviewed`	Swiss-Prot only (higher quality)	`true`
`fields`	Specific columns to return	All
`compressed`	gzip-compressed responses	`false`
`organism_id`	Filter by NCBI taxonomy ID	None
`include_isoform`	Include alternative protein isoforms	`false`

Best Practices

Always filter by reviewed:true for curated data — Swiss-Prot (reviewed) entries are manually curated with verified annotations. TrEMBL (unreviewed) entries are automatically predicted and may contain errors. Use unreviewed entries only when Swiss-Prot coverage is insufficient for your organism.
Request only the fields you need — Specify fields in your search to reduce response size and API load. A full UniProt entry can be several KB; requesting only accession, gene name, and sequence is dramatically faster for large result sets.
Use batch ID mapping for identifier conversion — Don't scrape or manually map identifiers. UniProt's ID Mapping service handles conversions between 100+ database identifiers (GenBank, PDB, RefSeq, Ensembl, GO). Submit batches of up to 100,000 IDs per request.
Implement pagination for large result sets — UniProt returns paginated results with a Link header pointing to the next page. Follow the next link until no more pages remain. Don't rely on size=500 to return all results — the total may exceed the page size.
Cache frequently accessed protein data locally — Protein entries change infrequently. Cache responses as JSON files keyed by accession and date. This reduces API calls and speeds up iterative analysis. UniProt releases updates every 4 weeks.

Common Issues

API returns 400 Bad Request for complex queries — UniProt's query syntax requires specific field names and operators. Use field names from the documentation (e.g., organism_id not organism). Boolean operators must be uppercase: AND, OR, NOT. Test queries in the web interface first.

ID mapping returns no results — Check that the database names are correct (case-sensitive): "Gene_Name", "UniProtKB-Swiss-Prot", "PDB", "RefSeq_Protein". Also verify that the IDs are valid and match the specified source database. Gene names must match official symbols.

Large downloads time out or fail — For bulk downloads (>10,000 entries), use the streaming endpoint with compressed responses. Set Accept-Encoding: gzip in headers and process the response stream incrementally rather than loading it all into memory.

⚠️ Loading Issue

Advanced Uniprot Database

Advanced Uniprot Database

When to Use This Skill

Quick Start

Core Concepts

UniProt Query Fields

Batch ID Mapping

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace