Advanced Matchms Platform
Enterprise-grade skill for mass, spectrometry, analysis, process. Includes structured workflows, validation checks, and reusable patterns for scientific.
Advanced Matchms Platform
Perform mass spectrometry data processing and spectral matching using matchms, a Python library for importing, processing, and comparing mass spectral data. This skill covers spectral similarity scoring, molecular networking, library matching, and metadata harmonization for metabolomics research.
When to Use This Skill
Choose Advanced Matchms Platform when you need to:
- Compare experimental mass spectra against spectral libraries (GNPS, MassBank, HMDB)
- Build molecular networks from untargeted metabolomics MS/MS data
- Clean, normalize, and harmonize spectral metadata from diverse sources
- Calculate spectral similarity scores using cosine, modified cosine, or other metrics
Consider alternatives when:
- You need LC-MS feature detection and alignment (use MZmine or XCMS first)
- You need structural elucidation from NMR data (use NMR-specific tools)
- You need protein identification from MS data (use MaxQuant or MSFragger)
Quick Start
# Install matchms with optional dependencies pip install matchms[chemistry]
from matchms import Spectrum from matchms.importing import load_from_mgf from matchms.similarity import CosineGreedy # Load spectra from MGF file spectra = list(load_from_mgf("experimental_spectra.mgf")) print(f"Loaded {len(spectra)} spectra") # Create a reference spectrum reference = Spectrum( mz=[100.0, 150.0, 200.0, 250.0], intensities=[0.5, 1.0, 0.3, 0.8], metadata={"compound_name": "Caffeine", "precursor_mz": 195.08} ) # Calculate similarity cosine = CosineGreedy(tolerance=0.3) score = cosine.pair(spectra[0], reference) print(f"Cosine similarity: {score['score']:.3f}") print(f"Matched peaks: {score['matches']}")
Core Concepts
Similarity Metrics
| Metric | Description | Best For |
|---|---|---|
CosineGreedy | Standard cosine similarity with greedy peak matching | General library matching |
ModifiedCosine | Accounts for precursor mass differences | Analog searching |
CosineHungarian | Optimal peak matching via Hungarian algorithm | High-accuracy comparisons |
FingerprintSimilarity | Molecular fingerprint comparison | Structural similarity |
NeutralLossesCosine | Compares neutral loss patterns | Structural class matching |
MetadataMatch | Exact metadata field comparison | Filtering by metadata |
Spectral Processing Pipeline
from matchms import Spectrum from matchms.importing import load_from_mgf from matchms.filtering import ( default_filters, normalize_intensities, select_by_mz, select_by_relative_intensity, add_precursor_mz, add_fingerprint ) def process_spectrum(spectrum): """Apply standard processing filters to a spectrum.""" spectrum = default_filters(spectrum) spectrum = add_precursor_mz(spectrum) spectrum = normalize_intensities(spectrum) spectrum = select_by_mz(spectrum, mz_from=10.0, mz_to=1000.0) spectrum = select_by_relative_intensity( spectrum, intensity_from=0.01 ) return spectrum # Load and process spectra raw_spectra = list(load_from_mgf("raw_data.mgf")) processed = [process_spectrum(s) for s in raw_spectra] processed = [s for s in processed if s is not None] print(f"Processed {len(processed)}/{len(raw_spectra)} spectra") # Library matching from matchms.similarity import CosineGreedy library = list(load_from_mgf("reference_library.mgf")) library = [process_spectrum(s) for s in library if s is not None] cosine = CosineGreedy(tolerance=0.3) scores = cosine.matrix(processed, library) # Find best matches import numpy as np for i, spectrum in enumerate(processed): best_idx = np.argmax(scores[i]) best_score = scores[i][best_idx] if best_score > 0.7: match_name = library[best_idx].get("compound_name", "Unknown") print(f"Spectrum {i}: {match_name} (score: {best_score:.3f})")
Molecular Networking
from matchms.similarity import ModifiedCosine from matchms.networking import SimilarityNetwork import networkx as nx def build_molecular_network(spectra, min_score=0.6, min_matches=4): """Build a molecular network from MS/MS spectra.""" mod_cosine = ModifiedCosine(tolerance=0.3) # Calculate all-vs-all similarity scores_matrix = mod_cosine.matrix(spectra, spectra) # Build network G = nx.Graph() for i in range(len(spectra)): G.add_node(i, **spectra[i].metadata) for j in range(i + 1, len(spectra)): score = scores_matrix[i][j] if isinstance(score, dict): if score["score"] >= min_score and score["matches"] >= min_matches: G.add_edge(i, j, weight=score["score"]) elif score >= min_score: G.add_edge(i, j, weight=score) # Find connected components (molecular families) families = list(nx.connected_components(G)) print(f"Network: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges") print(f"Molecular families: {len(families)}") return G, families network, families = build_molecular_network(processed)
Configuration
| Parameter | Description | Default |
|---|---|---|
tolerance | m/z tolerance for peak matching (Da) | 0.3 |
min_score | Minimum similarity score threshold | 0.7 |
min_matched_peaks | Minimum number of matched peaks | 4 |
mz_range | m/z range for analysis | [10, 1000] |
intensity_threshold | Minimum relative intensity to keep | 0.01 |
fingerprint_type | Molecular fingerprint algorithm | "daylight" |
Best Practices
-
Always normalize intensities before comparison — Raw intensity scales vary between instruments and runs. Apply
normalize_intensities()to scale the base peak to 1.0 before calculating similarity scores. Without normalization, intensity differences dominate over fragmentation pattern differences. -
Set appropriate m/z tolerance for your instrument — Use 0.01-0.05 Da for high-resolution data (Orbitrap, Q-TOF) and 0.3-0.5 Da for unit-resolution data (ion traps). Too-large tolerance causes false peak matches; too-small misses real matches due to calibration drift.
-
Filter low-intensity noise peaks — Remove peaks below 1% relative intensity before matching. Noise peaks inflate the total peak count and reduce cosine similarity scores by introducing spurious mismatches.
-
Use Modified Cosine for analog searching — Standard cosine fails when comparing spectra with different precursor masses. Modified Cosine shifts fragment peaks by the precursor mass difference, enabling discovery of structural analogs with mass shifts from modifications.
-
Validate top hits manually — Automated spectral matching produces false positives, especially at similarity scores between 0.6-0.8. Always inspect mirror plots of the query vs. reference spectrum for your top identifications before reporting them as confirmed matches.
Common Issues
All similarity scores are zero — This usually means spectra have no overlapping peaks within the tolerance window. Check that both query and reference spectra have been processed with the same filters and that the m/z tolerance is appropriate for your mass resolution. Also verify that spectra actually contain peaks after filtering.
Memory errors with large spectral libraries — Computing an all-vs-all similarity matrix for 10,000+ spectra requires significant memory. Use cosine.matrix() with sparse output or calculate similarities in batches. For molecular networking, consider pre-filtering spectra by precursor mass ranges before computing the full matrix.
Metadata fields missing after import — Different MGF file formats use different field names (e.g., PEPMASS vs PRECURSOR_MZ). Use default_filters() which harmonizes common field name variants, and check spectrum.metadata.keys() to see what fields were actually imported from your specific file format.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.