A

Advanced Biorxiv Database

All-in-one skill covering efficient, database, search, tool. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Advanced bioRxiv Database

A scientific computing skill for searching and retrieving preprints from bioRxiv — the preprint server for biology. Advanced bioRxiv Database helps you search by keywords, authors, date ranges, and subject areas, and retrieve full metadata including abstracts, DOIs, and publication status.

When to Use This Skill

Choose Advanced bioRxiv Database when:

  • Searching for recent preprints by topic, author, or date range
  • Monitoring new preprints in specific subject areas
  • Retrieving preprint metadata (abstract, DOI, dates, publication status)
  • Building literature review pipelines that include preprints

Consider alternatives when:

  • You need peer-reviewed papers only (use PubMed)
  • You need full-text access (use publisher APIs)
  • You're searching chemistry preprints (use ChemRxiv)
  • You need physics/math preprints (use arXiv)

Quick Start

claude "Search bioRxiv for recent CRISPR gene therapy preprints"
import requests from datetime import datetime, timedelta # bioRxiv API — content detail endpoint base_url = "https://api.biorxiv.org/details/biorxiv" # Search by date range end_date = datetime.now().strftime("%Y-%m-%d") start_date = (datetime.now() - timedelta(days=30)).strftime("%Y-%m-%d") response = requests.get(f"{base_url}/{start_date}/{end_date}/0/25") data = response.json() print(f"Total preprints in range: {data['messages'][0]['total']}") for paper in data["collection"]: print(f"\n{paper['title']}") print(f" DOI: {paper['doi']}") print(f" Category: {paper['category']}") print(f" Date: {paper['date']}") print(f" Published: {paper.get('published', 'Not yet')}")

Core Concepts

bioRxiv API Endpoints

EndpointPurposeFormat
/details/biorxiv/{start}/{end}Preprints by date rangeYYYY-MM-DD
/details/biorxiv/{start}/{end}/{cursor}/{count}Paginated resultscursor=0, count=100
/pubs/biorxiv/{start}/{end}Published preprints onlyYYYY-MM-DD
/publisher/{prefix}/{start}/{end}By publisher DOI prefixe.g., 10.1038

Keyword Search Implementation

def search_biorxiv(keyword, days=30, max_results=100): """Search bioRxiv preprints by keyword in title/abstract""" end = datetime.now().strftime("%Y-%m-%d") start = (datetime.now() - timedelta(days=days)).strftime("%Y-%m-%d") results = [] cursor = 0 batch_size = 100 while len(results) < max_results: url = f"https://api.biorxiv.org/details/biorxiv/{start}/{end}/{cursor}/{batch_size}" resp = requests.get(url) data = resp.json() if not data.get("collection"): break for paper in data["collection"]: text = f"{paper['title']} {paper['abstract']}".lower() if keyword.lower() in text: results.append(paper) cursor += batch_size total = int(data["messages"][0]["total"]) if cursor >= total: break return results[:max_results] # Search for CRISPR papers crispr_papers = search_biorxiv("CRISPR", days=30) print(f"Found {len(crispr_papers)} CRISPR preprints in last 30 days")

Publication Tracking

def check_publication_status(dois): """Check if bioRxiv preprints have been published in journals""" published = [] for doi in dois: # Use the pubs endpoint to check publication status url = f"https://api.biorxiv.org/pubs/biorxiv/{doi}" resp = requests.get(url) data = resp.json() if data.get("collection"): pub = data["collection"][0] if pub.get("published_doi"): published.append({ "preprint_doi": doi, "journal": pub.get("published_journal", "Unknown"), "published_doi": pub["published_doi"], "published_date": pub.get("published_date", "Unknown") }) return published

Configuration

ParameterDescriptionDefault
api_base_urlbioRxiv API base URLhttps://api.biorxiv.org
batch_sizeResults per API call (max 100)100
date_range_daysDefault search window30
include_abstractsInclude full abstracts in resultstrue
serverbiorxiv or medrxivbiorxiv

Best Practices

  1. Paginate through all results. The bioRxiv API returns at most 100 results per request. Use the cursor parameter to fetch subsequent pages. Check the total field in the response to know when you've retrieved everything.

  2. Cache results for repeated searches. bioRxiv API doesn't enforce strict rate limits, but repeated queries for the same date range waste bandwidth. Cache responses locally with timestamps and only re-fetch when the date range extends beyond your cache.

  3. Use date ranges strategically. Narrow date ranges return results faster. For monitoring new preprints, query the last 1-7 days rather than large date ranges. For comprehensive literature reviews, query month by month to manage result volumes.

  4. Check publication status for citing preprints. Before citing a bioRxiv preprint, check if the peer-reviewed version has been published using the /pubs endpoint. Citing the published version is preferred when available.

  5. Combine with PubMed for comprehensive coverage. bioRxiv only has preprints. For a complete literature review, search both bioRxiv (for recent, unpublished work) and PubMed (for peer-reviewed published work). Deduplicate by DOI.

Common Issues

API returns empty collection for valid date ranges. The API has a maximum date range per request (typically 1 month). Split longer ranges into monthly chunks and combine results. Also verify the date format is YYYY-MM-DD.

Keyword search misses relevant preprints. The bioRxiv API doesn't support full-text search — you can only filter client-side on the returned metadata. Use broad date ranges to capture more papers, then filter locally by matching keywords against title and abstract text.

Rate limiting during bulk downloads. While bioRxiv's API is generally permissive, rapid-fire requests may be throttled. Add a 1-second delay between paginated requests for bulk operations. For very large downloads, use the bioRxiv data dumps available on their FTP server.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates