Umap Learn Dynamic
Battle-tested skill for umap, dimensionality, reduction, fast. Includes structured workflows, validation checks, and reusable patterns for scientific.
Umap Learn Dynamic
Perform dimensionality reduction and data visualization using UMAP (Uniform Manifold Approximation and Projection), a manifold learning technique for embedding high-dimensional data into 2D or 3D spaces. This skill covers parameter tuning, supervised and semi-supervised UMAP, integration with clustering algorithms, and large-dataset optimization.
When to Use This Skill
Choose Umap Learn Dynamic when you need to:
- Visualize high-dimensional datasets (gene expression, embeddings, images) in 2D/3D
- Preserve both local and global data structure in low-dimensional representations
- Use dimensionality reduction as a preprocessing step before clustering
- Create interactive or publication-quality scatter plots of complex data
Consider alternatives when:
- You need linear dimensionality reduction with interpretable components (use PCA)
- You need guaranteed distance preservation (use MDS)
- You need probabilistic embeddings with uncertainty (use parametric t-SNE or PHATE)
Quick Start
pip install umap-learn scikit-learn matplotlib numpy
import umap import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_digits # Load high-dimensional data digits = load_digits() X, y = digits.data, digits.target print(f"Data shape: {X.shape}") # (1797, 64) # Fit UMAP reducer = umap.UMAP( n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean', random_state=42 ) embedding = reducer.fit_transform(X) # Visualize fig, ax = plt.subplots(figsize=(8, 6)) scatter = ax.scatter(embedding[:, 0], embedding[:, 1], c=y, cmap='Spectral', s=5, alpha=0.7) plt.colorbar(scatter, label='Digit') ax.set_title('UMAP of Handwritten Digits') ax.set_xlabel('UMAP 1') ax.set_ylabel('UMAP 2') fig.tight_layout() fig.savefig('umap_digits.pdf') print(f"Embedding shape: {embedding.shape}")
Core Concepts
Key Parameters
| Parameter | Effect | Range | Default |
|---|---|---|---|
n_neighbors | Local vs global structure balance | 5-200 | 15 |
min_dist | Cluster tightness in embedding | 0.0-1.0 | 0.1 |
n_components | Output dimensions | 2-100 | 2 |
metric | Distance function for input space | See below | "euclidean" |
spread | Scale of embedded points | 0.5-3.0 | 1.0 |
n_epochs | Optimization iterations | 200-1000 | Auto |
learning_rate | SGD step size | 0.1-10 | 1.0 |
UMAP + Clustering Pipeline
import umap import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import HDBSCAN from sklearn.preprocessing import StandardScaler from sklearn.datasets import make_blobs # Generate data X, y_true = make_blobs(n_samples=2000, n_features=50, centers=8, random_state=42) X = StandardScaler().fit_transform(X) # Step 1: UMAP for dimensionality reduction (higher dims for clustering) reducer_cluster = umap.UMAP( n_neighbors=30, n_components=10, # Higher dims preserve more structure min_dist=0.0, metric='euclidean', random_state=42 ) X_reduced = reducer_cluster.fit_transform(X) # Step 2: Cluster in UMAP space clusterer = HDBSCAN(min_cluster_size=30, min_samples=10) labels = clusterer.fit_predict(X_reduced) print(f"Found {len(set(labels)) - (1 if -1 in labels else 0)} clusters") print(f"Noise points: {(labels == -1).sum()}") # Step 3: UMAP for visualization (2D) reducer_viz = umap.UMAP( n_neighbors=15, n_components=2, min_dist=0.1, random_state=42 ) X_2d = reducer_viz.fit_transform(X) # Plot clusters fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # True labels axes[0].scatter(X_2d[:, 0], X_2d[:, 1], c=y_true, cmap='tab10', s=5, alpha=0.7) axes[0].set_title('True Labels') # HDBSCAN clusters scatter = axes[1].scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='tab10', s=5, alpha=0.7) axes[1].set_title('HDBSCAN Clusters (on UMAP embedding)') fig.tight_layout() fig.savefig('umap_clustering.pdf')
Configuration
| Parameter | Description | Default |
|---|---|---|
n_neighbors | Number of nearest neighbors for graph | 15 |
min_dist | Minimum distance between embedded points | 0.1 |
n_components | Embedding dimensionality | 2 |
metric | Input space distance (euclidean, cosine, manhattan, correlation) | "euclidean" |
target_metric | Supervised UMAP target metric | "categorical" |
random_state | Seed for reproducibility | None |
low_memory | Use less memory (slower) | True |
n_jobs | Parallel workers for neighbor search | -1 |
Best Practices
-
Set
n_neighborsbased on your analysis goal — Small values (5-15) emphasize local clusters and fine structure. Large values (50-200) preserve more global topology. For visualization, 15-30 works well. For pre-clustering, use 30-50 to capture broader neighborhood relationships. -
Use
min_dist=0.0when UMAP precedes clustering — Zero minimum distance allows UMAP to pack cluster members tightly, which helps downstream clustering algorithms find clear boundaries. For visualization where you want to see point density, usemin_dist=0.1-0.3to spread points out. -
Scale features before running UMAP — UMAP uses distance metrics that are sensitive to feature scale. Apply
StandardScaleror similar normalization first. Without scaling, high-magnitude features dominate the distance computation and distort the embedding. -
Use
n_components > 2when UMAP is a preprocessing step — 2D is for visualization, not for analysis. When UMAP feeds into clustering or classification, use 10-50 components to preserve more information. Only reduce to 2D for the final visualization step. -
Fix
random_statefor reproducible results — UMAP uses stochastic optimization, so different runs produce different embeddings. Setrandom_state=42(or any fixed value) for reproducibility. However, remember that reproducible doesn't mean the embedding is uniquely correct — multiple valid embeddings exist.
Common Issues
UMAP embedding looks like a single blob with no structure — The n_neighbors parameter is too large relative to the dataset, causing UMAP to oversmooth. Reduce n_neighbors (try 5-10). Also check that your data actually has meaningful clusters — PCA can verify this quickly.
Clusters in UMAP look clear but aren't real — UMAP can create visual clusters from continuous data that has no natural groupings. Don't interpret UMAP cluster shapes as evidence of distinct groups. Validate clusters with statistical tests (silhouette score, gap statistic) on the original high-dimensional data, not the embedding.
UMAP is slow on large datasets (>100K samples) — Use low_memory=True (default in recent versions) and install pynndescent for faster nearest neighbor computation. For very large datasets, subsample to 50K-100K points, fit UMAP, then use reducer.transform(remaining_data) to project the rest.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.