PathML Kit

Process and analyze whole-slide pathology images using PathML, a Python toolkit for computational pathology. This skill covers slide preprocessing, tissue detection, tile extraction, stain normalization, feature extraction, and machine learning workflows for digital pathology.

When to Use This Skill

Choose PathML Kit when you need to:

Preprocess whole-slide images (WSI) for machine learning pipelines
Extract and normalize tissue tiles from H&E or IHC-stained slides
Build computational pathology workflows with consistent preprocessing
Apply pre-trained pathology models or train custom classifiers on slide data

Consider alternatives when:

You need radiology image analysis (use MONAI or TorchIO)
You need basic image processing without pathology context (use scikit-image or OpenCV)
You need manual slide annotation without automation (use QuPath)

Quick Start


# Install PathML
pip install pathml


from pathml.core import SlideData
from pathml.preprocessing import Pipeline, BoxBlur, TissueDetectionHE

# Load a whole-slide image
slide = SlideData("tumor_sample.svs", name="tumor_001")

print(f"Slide dimensions: {slide.slide.dimensions}")
print(f"Magnification: {slide.slide.magnification}")
print(f"Num levels: {slide.slide.level_count}")

# Create preprocessing pipeline
pipeline = Pipeline([
    BoxBlur(kernel_size=15),
    TissueDetectionHE(
        mask_name="tissue",
        min_region_size=5000,
        threshold=30
    )
])

# Run pipeline on the slide
slide.run(pipeline, tile_size=256, level=0)
print(f"Extracted {len(slide.tiles)} tissue tiles")

Core Concepts

Pipeline Components

Component	Purpose	Parameters
`TissueDetectionHE`	Detect tissue in H&E slides	`threshold`, `min_region_size`
`StainNormalization`	Normalize staining variations	`target`, `method`
`BoxBlur`	Gaussian smoothing	`kernel_size`
`BinaryThreshold`	Binary mask creation	`threshold`
`MorphOpen`/`MorphClose`	Morphological operations	`kernel_size`
`TileExtractor`	Extract tiles from regions	`tile_size`, `stride`
`ForegroundDetection`	General foreground segmentation	`min_area`

Complete Pathology Workflow


from pathml.core import SlideData, Tile
from pathml.preprocessing import Pipeline
from pathml.preprocessing.transforms import (
    TissueDetectionHE,
    StainNormalizationMacenko,
    MedianBlur
)
import numpy as np

def process_slide(slide_path, tile_size=256, target_mpp=0.5):
    """Full pathology preprocessing pipeline."""
    slide = SlideData(slide_path)

    # Choose appropriate level for target resolution
    level = 0
    if hasattr(slide.slide, "mpp") and slide.slide.mpp:
        scale = target_mpp / slide.slide.mpp
        level = int(np.log2(scale)) if scale > 1 else 0

    # Build preprocessing pipeline
    pipeline = Pipeline([
        MedianBlur(kernel_size=5),
        TissueDetectionHE(
            mask_name="tissue",
            min_region_size=10000,
            threshold=25
        ),
        StainNormalizationMacenko(target="reference_slide.svs")
    ])

    # Run pipeline
    slide.run(
        pipeline,
        tile_size=tile_size,
        level=level,
        overwrite_existing=True
    )

    # Extract tiles with sufficient tissue content
    tissue_tiles = []
    for tile in slide.tiles:
        mask = tile.masks.get("tissue", None)
        if mask is not None:
            tissue_fraction = mask.sum() / mask.size
            if tissue_fraction > 0.5:  # >50% tissue
                tissue_tiles.append(tile)

    print(f"Total tiles: {len(slide.tiles)}")
    print(f"Tissue tiles (>50%): {len(tissue_tiles)}")
    return slide, tissue_tiles

slide, tiles = process_slide("specimen.svs")

Feature Extraction for ML


import torch
import torchvision.models as models
import torchvision.transforms as T
import numpy as np

def extract_tile_features(tiles, model_name="resnet50"):
    """Extract deep learning features from tissue tiles."""
    # Load pre-trained model
    model = getattr(models, model_name)(pretrained=True)
    model = torch.nn.Sequential(*list(model.children())[:-1])  # Remove FC
    model.eval()

    transform = T.Compose([
        T.ToPILImage(),
        T.Resize(224),
        T.ToTensor(),
        T.Normalize(mean=[0.485, 0.456, 0.406],
                     std=[0.229, 0.224, 0.225])
    ])

    features = []
    with torch.no_grad():
        for tile in tiles:
            img = tile.image
            tensor = transform(img).unsqueeze(0)
            feat = model(tensor).squeeze().numpy()
            features.append(feat)

    feature_matrix = np.stack(features)
    print(f"Feature matrix: {feature_matrix.shape}")
    return feature_matrix

# Extract features for downstream classification
features = extract_tile_features(tiles)

Configuration

Parameter	Description	Default
`tile_size`	Tile dimensions in pixels	`256`
`level`	Pyramid level for processing	`0` (highest resolution)
`stride`	Tile extraction stride	Equal to `tile_size`
`tissue_threshold`	Minimum tissue fraction per tile	`0.5`
`stain_method`	Normalization method (Macenko, Vahadane)	`"macenko"`
`target_mpp`	Target microns per pixel	`0.5`

Best Practices

Start at a lower resolution for tissue detection — Run tissue detection at level 2 or 3 (lower resolution) to save memory and time, then apply the tissue mask to extract tiles at full resolution. Tissue boundaries don't need pixel-level precision.
Normalize staining before feature extraction — H&E staining intensity varies significantly between labs and even between slides from the same lab. Apply Macenko or Vahadane stain normalization to a consistent reference before training ML models. Without normalization, models learn staining variation rather than morphology.
Filter tiles by tissue content — Many tiles from slide edges contain mostly background (white space). Set a minimum tissue fraction threshold (50-70%) to exclude low-information tiles. This reduces dataset size and prevents the model from learning to classify background.
Use multiple instance learning for slide-level labels — Most clinical labels apply to the entire slide, not individual tiles. Use MIL (multiple instance learning) approaches like attention-based pooling to aggregate tile-level features into slide-level predictions.
Store tile coordinates for spatial analysis — When extracting tiles, record the (x, y) coordinates of each tile in the original slide. This enables spatial analysis, heatmap generation, and mapping predictions back to specific regions of the slide.

Common Issues

Slide loading fails with "unsupported format" — PathML relies on OpenSlide for reading WSI formats (.svs, .ndpi, .mrxs). Install OpenSlide system library: brew install openslide (macOS) or apt-get install openslide-tools (Linux). If the format is truly unsupported, convert to TIFF first using bioformats.

Memory errors on large whole-slide images — WSIs at full resolution can be 100,000+ pixels wide. Never load the entire slide at once. Use PathML's tile-based processing pipeline which loads and processes one tile at a time, or work at a lower pyramid level for initial analysis.

Stain normalization changes tissue appearance dramatically — If the normalized output looks wrong (inverted colors, purple tissue turning blue), verify that the reference slide has typical H&E staining. The normalization target must be a representative high-quality slide. Also check that the input slide is actually H&E, not IHC or special stain.

⚠️ Loading Issue

Pathml Kit

PathML Kit

When to Use This Skill

Quick Start

Core Concepts

Pipeline Components

Complete Pathology Workflow

Feature Extraction for ML

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace