C

Comprehensive Geo Module

Comprehensive skill designed for access, ncbi, gene, expression. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Comprehensive GEO Module

A scientific computing skill for accessing NCBI's Gene Expression Omnibus (GEO) — the public repository for gene expression, genomics, and other high-throughput functional genomics data. Comprehensive GEO Module helps you search experiments, download processed datasets, and retrieve sample metadata for reanalysis.

When to Use This Skill

Choose Comprehensive GEO Module when:

  • Searching for gene expression datasets by disease, tissue, or platform
  • Downloading processed expression matrices (Series Matrix files)
  • Retrieving experiment metadata and sample annotations
  • Building meta-analyses from multiple GEO datasets

Consider alternatives when:

  • You need raw sequencing data (use NCBI SRA or ENA)
  • You need single-cell data specifically (use CellxGene)
  • You need microarray probe-level analysis (use Bioconductor's GEOquery)
  • You need clinical trial data (use ClinicalTrials.gov)

Quick Start

claude "Search GEO for breast cancer RNA-seq datasets and download one"
from Bio import Entrez import GEOparse Entrez.email = "[email protected]" # Search GEO DataSets handle = Entrez.esearch( db="gds", term="breast cancer[title] AND RNA-seq[description] AND Homo sapiens[organism]", retmax=10 ) results = Entrez.read(handle) print(f"Found {results['Count']} datasets") # Download a GEO Series gse = GEOparse.get_GEO(geo="GSE96058", destdir="./geo_data") # Explore metadata print(f"Title: {gse.metadata['title'][0]}") print(f"Samples: {len(gse.gsms)}") print(f"Platforms: {list(gse.gpls.keys())}") # Get expression matrix for gsm_name, gsm in list(gse.gsms.items())[:3]: print(f"\n{gsm_name}: {gsm.metadata['title'][0]}") print(f" Source: {gsm.metadata.get('source_name_ch1', ['N/A'])[0]}")

Core Concepts

GEO Data Structure

EntityPrefixDescription
PlatformGPLArray/sequencing platform definition
SampleGSMIndividual sample data and metadata
SeriesGSECollection of related samples
DatasetGDSCurated, normalized dataset

Downloading Expression Data

import GEOparse import pandas as pd # Download Series Matrix (processed data) gse = GEOparse.get_GEO(geo="GSE12345", destdir="./data") # Extract expression table expression = gse.pivot_samples("VALUE") print(f"Expression matrix: {expression.shape}") # Extract sample metadata metadata = [] for name, gsm in gse.gsms.items(): meta = { "sample": name, "title": gsm.metadata["title"][0], "source": gsm.metadata.get("source_name_ch1", [""])[0], } # Extract characteristics for char in gsm.metadata.get("characteristics_ch1", []): key, val = char.split(": ", 1) if ": " in char else (char, "") meta[key.strip()] = val.strip() metadata.append(meta) meta_df = pd.DataFrame(metadata) print(f"Metadata: {meta_df.shape}")

Batch Download

def download_geo_series(gse_ids, destdir="./geo_data"): """Download multiple GEO Series""" datasets = {} for gse_id in gse_ids: try: gse = GEOparse.get_GEO(geo=gse_id, destdir=destdir) datasets[gse_id] = { "title": gse.metadata["title"][0], "samples": len(gse.gsms), "platform": list(gse.gpls.keys()), "data": gse } print(f"OK {gse_id}: {datasets[gse_id]['title'][:60]}...") except Exception as e: print(f"FAIL {gse_id}: {e}") return datasets studies = download_geo_series(["GSE96058", "GSE81538", "GSE62944"])

Configuration

ParameterDescriptionDefault
Entrez.emailRequired email for NCBI APIRequired
destdirDownload directoryCurrent directory
howDownload format (full, brief)full
annotateMap probes to gene symbolstrue
silentSuppress download messagesfalse

Best Practices

  1. Use GDS (DataSets) for curated data, GSE (Series) for raw. GDS datasets are normalized and curated by NCBI — ideal for quick reanalysis. GSE series contain researcher-submitted data with less standardization but more variety.

  2. Check the platform before cross-study comparisons. Different platforms (Affymetrix, Illumina, RNA-seq) produce non-comparable expression values. Only merge datasets from the same platform, or use cross-platform normalization methods.

  3. Extract sample metadata systematically. GEO metadata is stored as free-text in characteristics_ch1 fields. Parse these consistently using key-value splitting and standardize labels across datasets for clean annotations.

  4. Cache downloads locally. GEO files can be large. Download to a persistent directory and check for existing files before re-downloading. GEOparse supports local file loading to avoid redundant downloads.

  5. Verify data quality before analysis. Check for batch effects (PCA colored by processing date), outlier samples (correlation heatmaps), and expression distribution consistency. Remove or flag problematic samples before statistical analysis.

Common Issues

GEOparse fails to download large Series. Very large GSE entries (>1000 samples) may time out. Download the Series Matrix file directly with wget from the GEO FTP site, then load with GEOparse.get_GEO(filepath="local_file.txt.gz").

Expression matrix has probe IDs instead of gene symbols. The raw matrix uses platform-specific probe IDs. Use the GPL annotation to map probes to gene symbols. For multi-probe genes, take the mean or maximum expression across probes.

Metadata fields are inconsistent across samples. Different submitters use different field names and formats. Standardize metadata programmatically: map "tissue: liver" and "organ: liver" and "source: human liver" to a consistent "tissue" column.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates