U

Umap Learn Dynamic

Battle-tested skill for umap, dimensionality, reduction, fast. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Umap Learn Dynamic

Perform dimensionality reduction and data visualization using UMAP (Uniform Manifold Approximation and Projection), a manifold learning technique for embedding high-dimensional data into 2D or 3D spaces. This skill covers parameter tuning, supervised and semi-supervised UMAP, integration with clustering algorithms, and large-dataset optimization.

When to Use This Skill

Choose Umap Learn Dynamic when you need to:

  • Visualize high-dimensional datasets (gene expression, embeddings, images) in 2D/3D
  • Preserve both local and global data structure in low-dimensional representations
  • Use dimensionality reduction as a preprocessing step before clustering
  • Create interactive or publication-quality scatter plots of complex data

Consider alternatives when:

  • You need linear dimensionality reduction with interpretable components (use PCA)
  • You need guaranteed distance preservation (use MDS)
  • You need probabilistic embeddings with uncertainty (use parametric t-SNE or PHATE)

Quick Start

pip install umap-learn scikit-learn matplotlib numpy
import umap import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_digits # Load high-dimensional data digits = load_digits() X, y = digits.data, digits.target print(f"Data shape: {X.shape}") # (1797, 64) # Fit UMAP reducer = umap.UMAP( n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean', random_state=42 ) embedding = reducer.fit_transform(X) # Visualize fig, ax = plt.subplots(figsize=(8, 6)) scatter = ax.scatter(embedding[:, 0], embedding[:, 1], c=y, cmap='Spectral', s=5, alpha=0.7) plt.colorbar(scatter, label='Digit') ax.set_title('UMAP of Handwritten Digits') ax.set_xlabel('UMAP 1') ax.set_ylabel('UMAP 2') fig.tight_layout() fig.savefig('umap_digits.pdf') print(f"Embedding shape: {embedding.shape}")

Core Concepts

Key Parameters

ParameterEffectRangeDefault
n_neighborsLocal vs global structure balance5-20015
min_distCluster tightness in embedding0.0-1.00.1
n_componentsOutput dimensions2-1002
metricDistance function for input spaceSee below"euclidean"
spreadScale of embedded points0.5-3.01.0
n_epochsOptimization iterations200-1000Auto
learning_rateSGD step size0.1-101.0

UMAP + Clustering Pipeline

import umap import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import HDBSCAN from sklearn.preprocessing import StandardScaler from sklearn.datasets import make_blobs # Generate data X, y_true = make_blobs(n_samples=2000, n_features=50, centers=8, random_state=42) X = StandardScaler().fit_transform(X) # Step 1: UMAP for dimensionality reduction (higher dims for clustering) reducer_cluster = umap.UMAP( n_neighbors=30, n_components=10, # Higher dims preserve more structure min_dist=0.0, metric='euclidean', random_state=42 ) X_reduced = reducer_cluster.fit_transform(X) # Step 2: Cluster in UMAP space clusterer = HDBSCAN(min_cluster_size=30, min_samples=10) labels = clusterer.fit_predict(X_reduced) print(f"Found {len(set(labels)) - (1 if -1 in labels else 0)} clusters") print(f"Noise points: {(labels == -1).sum()}") # Step 3: UMAP for visualization (2D) reducer_viz = umap.UMAP( n_neighbors=15, n_components=2, min_dist=0.1, random_state=42 ) X_2d = reducer_viz.fit_transform(X) # Plot clusters fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # True labels axes[0].scatter(X_2d[:, 0], X_2d[:, 1], c=y_true, cmap='tab10', s=5, alpha=0.7) axes[0].set_title('True Labels') # HDBSCAN clusters scatter = axes[1].scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='tab10', s=5, alpha=0.7) axes[1].set_title('HDBSCAN Clusters (on UMAP embedding)') fig.tight_layout() fig.savefig('umap_clustering.pdf')

Configuration

ParameterDescriptionDefault
n_neighborsNumber of nearest neighbors for graph15
min_distMinimum distance between embedded points0.1
n_componentsEmbedding dimensionality2
metricInput space distance (euclidean, cosine, manhattan, correlation)"euclidean"
target_metricSupervised UMAP target metric"categorical"
random_stateSeed for reproducibilityNone
low_memoryUse less memory (slower)True
n_jobsParallel workers for neighbor search-1

Best Practices

  1. Set n_neighbors based on your analysis goal — Small values (5-15) emphasize local clusters and fine structure. Large values (50-200) preserve more global topology. For visualization, 15-30 works well. For pre-clustering, use 30-50 to capture broader neighborhood relationships.

  2. Use min_dist=0.0 when UMAP precedes clustering — Zero minimum distance allows UMAP to pack cluster members tightly, which helps downstream clustering algorithms find clear boundaries. For visualization where you want to see point density, use min_dist=0.1-0.3 to spread points out.

  3. Scale features before running UMAP — UMAP uses distance metrics that are sensitive to feature scale. Apply StandardScaler or similar normalization first. Without scaling, high-magnitude features dominate the distance computation and distort the embedding.

  4. Use n_components > 2 when UMAP is a preprocessing step — 2D is for visualization, not for analysis. When UMAP feeds into clustering or classification, use 10-50 components to preserve more information. Only reduce to 2D for the final visualization step.

  5. Fix random_state for reproducible results — UMAP uses stochastic optimization, so different runs produce different embeddings. Set random_state=42 (or any fixed value) for reproducibility. However, remember that reproducible doesn't mean the embedding is uniquely correct — multiple valid embeddings exist.

Common Issues

UMAP embedding looks like a single blob with no structure — The n_neighbors parameter is too large relative to the dataset, causing UMAP to oversmooth. Reduce n_neighbors (try 5-10). Also check that your data actually has meaningful clusters — PCA can verify this quickly.

Clusters in UMAP look clear but aren't real — UMAP can create visual clusters from continuous data that has no natural groupings. Don't interpret UMAP cluster shapes as evidence of distinct groups. Validate clusters with statistical tests (silhouette score, gap statistic) on the original high-dimensional data, not the embedding.

UMAP is slow on large datasets (>100K samples) — Use low_memory=True (default in recent versions) and install pynndescent for faster nearest neighbor computation. For very large datasets, subsample to 50K-100K points, fit UMAP, then use reducer.transform(remaining_data) to project the rest.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates