Advanced Uniprot Database
Comprehensive skill designed for direct, rest, access, uniprot. Includes structured workflows, validation checks, and reusable patterns for scientific.
Advanced Uniprot Database
Query and analyze protein data from UniProt, the world's most comprehensive protein sequence and functional annotation database. This skill covers REST API queries, programmatic sequence retrieval, protein feature extraction, cross-reference mapping, and batch processing for proteomics research.
When to Use This Skill
Choose Advanced Uniprot Database when you need to:
- Retrieve protein sequences, functions, and annotations programmatically
- Map between protein identifiers (UniProt, PDB, RefSeq, Gene names)
- Search for proteins by function, taxonomy, subcellular location, or disease association
- Download and process bulk protein data for bioinformatics analysis
Consider alternatives when:
- You need protein 3D structure analysis (use PDB Database Complete)
- You need protein-protein interaction networks (use STRING or IntAct)
- You need gene-level information rather than protein (use OpenAlex or NCBI Gene)
Quick Start
pip install requests pandas biopython
import requests import pandas as pd UNIPROT_API = "https://rest.uniprot.org" def search_uniprot(query, format="json", size=10, fields=None): """Search UniProt with the REST API.""" params = { "query": query, "format": format, "size": size, } if fields: params["fields"] = ",".join(fields) response = requests.get(f"{UNIPROT_API}/uniprotkb/search", params=params) response.raise_for_status() return response.json() # Search for human insulin-related proteins results = search_uniprot( query="insulin AND organism_id:9606 AND reviewed:true", fields=["accession", "protein_name", "gene_names", "length", "cc_function", "cc_subcellular_location"], size=5 ) for entry in results["results"]: acc = entry["primaryAccession"] name = entry["proteinDescription"]["recommendedName"]["fullName"]["value"] genes = entry.get("genes", [{}]) gene = genes[0].get("geneName", {}).get("value", "N/A") if genes else "N/A" length = entry["sequence"]["length"] print(f"{acc} | {gene} | {name} | {length} aa") # Get single protein by accession def get_protein(accession, format="json"): response = requests.get(f"{UNIPROT_API}/uniprotkb/{accession}") response.raise_for_status() return response.json() p53 = get_protein("P04637") print(f"\nTP53: {p53['proteinDescription']['recommendedName']['fullName']['value']}") print(f"Length: {p53['sequence']['length']} aa") print(f"Sequence: {p53['sequence']['value'][:60]}...")
Core Concepts
UniProt Query Fields
| Field | Description | Example Query |
|---|---|---|
protein_name | Protein name | protein_name:kinase |
gene | Gene name | gene:TP53 |
organism_id | NCBI taxonomy ID | organism_id:9606 (human) |
reviewed | Swiss-Prot (reviewed) only | reviewed:true |
ec | Enzyme classification | ec:2.7.11.1 |
keyword | UniProt keywords | keyword:phosphoprotein |
cc_subcellular_location | Cellular location | cc_subcellular_location:nucleus |
cc_disease | Disease association | cc_disease:cancer |
length | Sequence length range | length:[100 TO 500] |
ft_domain | Protein domains | ft_domain:kinase |
Batch ID Mapping
import requests import time def map_ids(from_db, to_db, ids): """Map between database identifiers via UniProt ID Mapping.""" # Submit mapping job response = requests.post( f"https://rest.uniprot.org/idmapping/run", data={"from": from_db, "to": to_db, "ids": ",".join(ids)} ) response.raise_for_status() job_id = response.json()["jobId"] # Poll for results while True: status = requests.get( f"https://rest.uniprot.org/idmapping/status/{job_id}" ) status_data = status.json() if "jobStatus" in status_data: if status_data["jobStatus"] == "RUNNING": time.sleep(2) continue if "results" in status_data or "redirectURL" in status_data: break time.sleep(2) # Get results results_url = f"https://rest.uniprot.org/idmapping/uniprotkb/results/{job_id}" results = requests.get(results_url).json() mapping = {} for result in results.get("results", []): from_id = result["from"] to_id = result["to"]["primaryAccession"] mapping[from_id] = to_id return mapping # Map gene names to UniProt accessions genes = ["TP53", "BRCA1", "EGFR", "KRAS", "MYC"] mapping = map_ids("Gene_Name", "UniProtKB-Swiss-Prot", genes) for gene, accession in mapping.items(): print(f" {gene} → {accession}")
Configuration
| Parameter | Description | Default |
|---|---|---|
base_url | UniProt REST API endpoint | "https://rest.uniprot.org" |
format | Response format (json, tsv, fasta, xml, gff) | "json" |
size | Results per page | 25 |
reviewed | Swiss-Prot only (higher quality) | true |
fields | Specific columns to return | All |
compressed | gzip-compressed responses | false |
organism_id | Filter by NCBI taxonomy ID | None |
include_isoform | Include alternative protein isoforms | false |
Best Practices
-
Always filter by
reviewed:truefor curated data — Swiss-Prot (reviewed) entries are manually curated with verified annotations. TrEMBL (unreviewed) entries are automatically predicted and may contain errors. Use unreviewed entries only when Swiss-Prot coverage is insufficient for your organism. -
Request only the fields you need — Specify
fieldsin your search to reduce response size and API load. A full UniProt entry can be several KB; requesting only accession, gene name, and sequence is dramatically faster for large result sets. -
Use batch ID mapping for identifier conversion — Don't scrape or manually map identifiers. UniProt's ID Mapping service handles conversions between 100+ database identifiers (GenBank, PDB, RefSeq, Ensembl, GO). Submit batches of up to 100,000 IDs per request.
-
Implement pagination for large result sets — UniProt returns paginated results with a
Linkheader pointing to the next page. Follow thenextlink until no more pages remain. Don't rely onsize=500to return all results — the total may exceed the page size. -
Cache frequently accessed protein data locally — Protein entries change infrequently. Cache responses as JSON files keyed by accession and date. This reduces API calls and speeds up iterative analysis. UniProt releases updates every 4 weeks.
Common Issues
API returns 400 Bad Request for complex queries — UniProt's query syntax requires specific field names and operators. Use field names from the documentation (e.g., organism_id not organism). Boolean operators must be uppercase: AND, OR, NOT. Test queries in the web interface first.
ID mapping returns no results — Check that the database names are correct (case-sensitive): "Gene_Name", "UniProtKB-Swiss-Prot", "PDB", "RefSeq_Protein". Also verify that the IDs are valid and match the specified source database. Gene names must match official symbols.
Large downloads time out or fail — For bulk downloads (>10,000 entries), use the streaming endpoint with compressed responses. Set Accept-Encoding: gzip in headers and process the response stream incrementally rather than loading it all into memory.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.