A

Advanced Uniprot Database

Comprehensive skill designed for direct, rest, access, uniprot. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Advanced Uniprot Database

Query and analyze protein data from UniProt, the world's most comprehensive protein sequence and functional annotation database. This skill covers REST API queries, programmatic sequence retrieval, protein feature extraction, cross-reference mapping, and batch processing for proteomics research.

When to Use This Skill

Choose Advanced Uniprot Database when you need to:

  • Retrieve protein sequences, functions, and annotations programmatically
  • Map between protein identifiers (UniProt, PDB, RefSeq, Gene names)
  • Search for proteins by function, taxonomy, subcellular location, or disease association
  • Download and process bulk protein data for bioinformatics analysis

Consider alternatives when:

  • You need protein 3D structure analysis (use PDB Database Complete)
  • You need protein-protein interaction networks (use STRING or IntAct)
  • You need gene-level information rather than protein (use OpenAlex or NCBI Gene)

Quick Start

pip install requests pandas biopython
import requests import pandas as pd UNIPROT_API = "https://rest.uniprot.org" def search_uniprot(query, format="json", size=10, fields=None): """Search UniProt with the REST API.""" params = { "query": query, "format": format, "size": size, } if fields: params["fields"] = ",".join(fields) response = requests.get(f"{UNIPROT_API}/uniprotkb/search", params=params) response.raise_for_status() return response.json() # Search for human insulin-related proteins results = search_uniprot( query="insulin AND organism_id:9606 AND reviewed:true", fields=["accession", "protein_name", "gene_names", "length", "cc_function", "cc_subcellular_location"], size=5 ) for entry in results["results"]: acc = entry["primaryAccession"] name = entry["proteinDescription"]["recommendedName"]["fullName"]["value"] genes = entry.get("genes", [{}]) gene = genes[0].get("geneName", {}).get("value", "N/A") if genes else "N/A" length = entry["sequence"]["length"] print(f"{acc} | {gene} | {name} | {length} aa") # Get single protein by accession def get_protein(accession, format="json"): response = requests.get(f"{UNIPROT_API}/uniprotkb/{accession}") response.raise_for_status() return response.json() p53 = get_protein("P04637") print(f"\nTP53: {p53['proteinDescription']['recommendedName']['fullName']['value']}") print(f"Length: {p53['sequence']['length']} aa") print(f"Sequence: {p53['sequence']['value'][:60]}...")

Core Concepts

UniProt Query Fields

FieldDescriptionExample Query
protein_nameProtein nameprotein_name:kinase
geneGene namegene:TP53
organism_idNCBI taxonomy IDorganism_id:9606 (human)
reviewedSwiss-Prot (reviewed) onlyreviewed:true
ecEnzyme classificationec:2.7.11.1
keywordUniProt keywordskeyword:phosphoprotein
cc_subcellular_locationCellular locationcc_subcellular_location:nucleus
cc_diseaseDisease associationcc_disease:cancer
lengthSequence length rangelength:[100 TO 500]
ft_domainProtein domainsft_domain:kinase

Batch ID Mapping

import requests import time def map_ids(from_db, to_db, ids): """Map between database identifiers via UniProt ID Mapping.""" # Submit mapping job response = requests.post( f"https://rest.uniprot.org/idmapping/run", data={"from": from_db, "to": to_db, "ids": ",".join(ids)} ) response.raise_for_status() job_id = response.json()["jobId"] # Poll for results while True: status = requests.get( f"https://rest.uniprot.org/idmapping/status/{job_id}" ) status_data = status.json() if "jobStatus" in status_data: if status_data["jobStatus"] == "RUNNING": time.sleep(2) continue if "results" in status_data or "redirectURL" in status_data: break time.sleep(2) # Get results results_url = f"https://rest.uniprot.org/idmapping/uniprotkb/results/{job_id}" results = requests.get(results_url).json() mapping = {} for result in results.get("results", []): from_id = result["from"] to_id = result["to"]["primaryAccession"] mapping[from_id] = to_id return mapping # Map gene names to UniProt accessions genes = ["TP53", "BRCA1", "EGFR", "KRAS", "MYC"] mapping = map_ids("Gene_Name", "UniProtKB-Swiss-Prot", genes) for gene, accession in mapping.items(): print(f" {gene}{accession}")

Configuration

ParameterDescriptionDefault
base_urlUniProt REST API endpoint"https://rest.uniprot.org"
formatResponse format (json, tsv, fasta, xml, gff)"json"
sizeResults per page25
reviewedSwiss-Prot only (higher quality)true
fieldsSpecific columns to returnAll
compressedgzip-compressed responsesfalse
organism_idFilter by NCBI taxonomy IDNone
include_isoformInclude alternative protein isoformsfalse

Best Practices

  1. Always filter by reviewed:true for curated data — Swiss-Prot (reviewed) entries are manually curated with verified annotations. TrEMBL (unreviewed) entries are automatically predicted and may contain errors. Use unreviewed entries only when Swiss-Prot coverage is insufficient for your organism.

  2. Request only the fields you need — Specify fields in your search to reduce response size and API load. A full UniProt entry can be several KB; requesting only accession, gene name, and sequence is dramatically faster for large result sets.

  3. Use batch ID mapping for identifier conversion — Don't scrape or manually map identifiers. UniProt's ID Mapping service handles conversions between 100+ database identifiers (GenBank, PDB, RefSeq, Ensembl, GO). Submit batches of up to 100,000 IDs per request.

  4. Implement pagination for large result sets — UniProt returns paginated results with a Link header pointing to the next page. Follow the next link until no more pages remain. Don't rely on size=500 to return all results — the total may exceed the page size.

  5. Cache frequently accessed protein data locally — Protein entries change infrequently. Cache responses as JSON files keyed by accession and date. This reduces API calls and speeds up iterative analysis. UniProt releases updates every 4 weeks.

Common Issues

API returns 400 Bad Request for complex queries — UniProt's query syntax requires specific field names and operators. Use field names from the documentation (e.g., organism_id not organism). Boolean operators must be uppercase: AND, OR, NOT. Test queries in the web interface first.

ID mapping returns no results — Check that the database names are correct (case-sensitive): "Gene_Name", "UniProtKB-Swiss-Prot", "PDB", "RefSeq_Protein". Also verify that the IDs are valid and match the specified source database. Gene names must match official symbols.

Large downloads time out or fail — For bulk downloads (>10,000 entries), use the streaming endpoint with compressed responses. Set Accept-Encoding: gzip in headers and process the response stream incrementally rather than loading it all into memory.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates