OpenAlex Database Elite

Query and analyze scholarly research data from OpenAlex, an open catalog of 240M+ academic works, authors, institutions, and topics. This skill covers API querying, citation analysis, research trend tracking, author profiling, and bibliometric analysis at scale.

When to Use This Skill

Choose OpenAlex Database Elite when you need to:

Search and filter academic publications by topic, author, institution, or date
Analyze citation networks and track research impact metrics
Map research trends and identify emerging topics in a field
Build bibliometric profiles for authors, institutions, or journals

Consider alternatives when:

You need full-text article content (use publisher APIs or Unpaywall)
You need grant funding data (use NIH Reporter or NSF Award Search)
You need patent literature (use Google Patents or Lens.org)

Quick Start


pip install pyalex pandas matplotlib


from pyalex import Works, Authors, Institutions
import pandas as pd

# Search for recent AI papers
works = (
    Works()
    .search("large language models")
    .filter(publication_year=2024)
    .sort(cited_by_count="desc")
    .get(per_page=25)
)

for work in works[:5]:
    print(f"{work['title']}")
    print(f"  Citations: {work['cited_by_count']}, "
          f"Year: {work['publication_year']}")
    print()

Core Concepts

API Entities

Entity	Count	Description
Works	240M+	Scholarly papers, books, preprints
Authors	90M+	Researcher profiles with affiliations
Institutions	100K+	Universities, research institutes
Sources	250K+	Journals, repositories, conferences
Topics	65K+	Hierarchical research topics
Funders	30K+	Research funding organizations
Publishers	10K+	Publishing companies

Advanced Querying and Analysis


from pyalex import Works
import pandas as pd

def research_landscape(query, years=(2020, 2024), top_n=100):
    """Map the research landscape for a topic."""
    all_works = []

    for year in range(years[0], years[1] + 1):
        works = (
            Works()
            .search(query)
            .filter(publication_year=year, type="article")
            .sort(cited_by_count="desc")
            .get(per_page=top_n)
        )
        for w in works:
            all_works.append({
                "title": w["title"],
                "year": w["publication_year"],
                "citations": w["cited_by_count"],
                "doi": w.get("doi", ""),
                "source": (w.get("primary_location", {})
                          .get("source", {}).get("display_name", "Unknown")),
                "institution": (w.get("authorships", [{}])[0]
                               .get("institutions", [{}])[0]
                               .get("display_name", "Unknown")
                               if w.get("authorships") else "Unknown"),
                "open_access": w.get("open_access", {}).get("is_oa", False),
                "topics": [t["display_name"]
                          for t in w.get("topics", [])[:3]]
            })

    df = pd.DataFrame(all_works)

    # Summary statistics
    print(f"Total papers: {len(df)}")
    print(f"\nTop venues:")
    print(df["source"].value_counts().head(10))
    print(f"\nTop institutions:")
    print(df["institution"].value_counts().head(10))
    print(f"\nOpen access rate: {df['open_access'].mean():.1%}")
    print(f"\nPapers by year:")
    print(df.groupby("year")["citations"].agg(["count", "sum", "mean"]))

    return df

df = research_landscape("CRISPR gene editing")

Citation Network Analysis


from pyalex import Works
import networkx as nx

def build_citation_network(seed_doi, depth=2, max_refs=20):
    """Build a citation network from a seed paper."""
    G = nx.DiGraph()
    visited = set()
    queue = [(seed_doi, 0)]

    while queue:
        doi, level = queue.pop(0)
        if doi in visited or level >= depth:
            continue
        visited.add(doi)

        works = Works().filter(doi=doi).get()
        if not works:
            continue

        work = works[0]
        node_id = work["id"]
        G.add_node(node_id, title=work["title"],
                   citations=work["cited_by_count"],
                   year=work["publication_year"])

        # Get references (papers this work cites)
        refs = work.get("referenced_works", [])[:max_refs]
        for ref_id in refs:
            G.add_edge(node_id, ref_id)
            if ref_id not in visited:
                # Get DOI for referenced work
                ref_works = Works().filter(openalex=ref_id).get()
                if ref_works:
                    ref_doi = ref_works[0].get("doi", "")
                    if ref_doi:
                        queue.append((ref_doi, level + 1))

    print(f"Network: {G.number_of_nodes()} papers, "
          f"{G.number_of_edges()} citations")

    # Find most cited papers in the network
    in_degrees = sorted(G.in_degree(), key=lambda x: -x[1])[:10]
    for node, deg in in_degrees:
        title = G.nodes[node].get("title", "Unknown")
        print(f"  [{deg} citations] {title}")

    return G

network = build_citation_network("https://doi.org/10.1038/s41586-020-2649-2")

Configuration

Parameter	Description	Default
`per_page`	Results per API page	`25` (max 200)
`email`	Polite pool email for faster access	Recommended
`api_key`	Premium API key for higher limits	Optional
`sort`	Sort field and direction	`"relevance_score"`
`select`	Fields to return (reduces payload)	All fields
`sample_size`	Random sample size	`None`

Best Practices

Set your email for the polite pool — Configure pyalex.config.email = "[email protected]" to get faster API responses. The polite pool has higher rate limits than anonymous access. This is free and just requires a valid email address.
Use select to reduce response size — If you only need titles and citation counts, use .select(["title", "cited_by_count"]) to reduce payload size by 90%. This significantly speeds up large queries and reduces memory usage.
Paginate through large result sets — The API returns at most 200 results per page. Use cursor-based pagination (get(per_page=200) with cursor) for queries that match thousands of papers rather than relying on the first page of results.
Filter by type for cleaner results — Add .filter(type="article") to exclude book chapters, datasets, and other non-article types from search results. Mixed types skew bibliometric statistics and citation counts.
Cache API responses for iterative analysis — Save query results to local JSON or parquet files before doing analysis. This avoids re-fetching the same data when adjusting your analysis code and respects the API's rate limits.

Common Issues

Search returns unexpected results — OpenAlex search uses full-text matching, which can return tangentially related papers. Use .filter() with specific fields (title, abstract, topic) in addition to .search() for more precise results. Combine multiple filters to narrow results effectively.

Author disambiguation issues — Authors with common names may be merged or split in OpenAlex. Verify author identity by checking their ORCID (author.orcid), affiliated institutions, and publication history. Use the author's OpenAlex ID rather than name-based search for reliable profiling.

Rate limiting on large queries — Without an email in the polite pool, you're limited to about 10 requests per second. Add delays between pagination calls with time.sleep(0.1) or use the polite pool. For very large analyses (100K+ papers), consider using OpenAlex's data snapshot download instead of the API.

⚠️ Loading Issue

Openalex Database Elite

OpenAlex Database Elite

When to Use This Skill

Quick Start

Core Concepts

API Entities

Advanced Querying and Analysis

Citation Network Analysis

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace