A

Advanced Matchms Platform

Enterprise-grade skill for mass, spectrometry, analysis, process. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Advanced Matchms Platform

Perform mass spectrometry data processing and spectral matching using matchms, a Python library for importing, processing, and comparing mass spectral data. This skill covers spectral similarity scoring, molecular networking, library matching, and metadata harmonization for metabolomics research.

When to Use This Skill

Choose Advanced Matchms Platform when you need to:

  • Compare experimental mass spectra against spectral libraries (GNPS, MassBank, HMDB)
  • Build molecular networks from untargeted metabolomics MS/MS data
  • Clean, normalize, and harmonize spectral metadata from diverse sources
  • Calculate spectral similarity scores using cosine, modified cosine, or other metrics

Consider alternatives when:

  • You need LC-MS feature detection and alignment (use MZmine or XCMS first)
  • You need structural elucidation from NMR data (use NMR-specific tools)
  • You need protein identification from MS data (use MaxQuant or MSFragger)

Quick Start

# Install matchms with optional dependencies pip install matchms[chemistry]
from matchms import Spectrum from matchms.importing import load_from_mgf from matchms.similarity import CosineGreedy # Load spectra from MGF file spectra = list(load_from_mgf("experimental_spectra.mgf")) print(f"Loaded {len(spectra)} spectra") # Create a reference spectrum reference = Spectrum( mz=[100.0, 150.0, 200.0, 250.0], intensities=[0.5, 1.0, 0.3, 0.8], metadata={"compound_name": "Caffeine", "precursor_mz": 195.08} ) # Calculate similarity cosine = CosineGreedy(tolerance=0.3) score = cosine.pair(spectra[0], reference) print(f"Cosine similarity: {score['score']:.3f}") print(f"Matched peaks: {score['matches']}")

Core Concepts

Similarity Metrics

MetricDescriptionBest For
CosineGreedyStandard cosine similarity with greedy peak matchingGeneral library matching
ModifiedCosineAccounts for precursor mass differencesAnalog searching
CosineHungarianOptimal peak matching via Hungarian algorithmHigh-accuracy comparisons
FingerprintSimilarityMolecular fingerprint comparisonStructural similarity
NeutralLossesCosineCompares neutral loss patternsStructural class matching
MetadataMatchExact metadata field comparisonFiltering by metadata

Spectral Processing Pipeline

from matchms import Spectrum from matchms.importing import load_from_mgf from matchms.filtering import ( default_filters, normalize_intensities, select_by_mz, select_by_relative_intensity, add_precursor_mz, add_fingerprint ) def process_spectrum(spectrum): """Apply standard processing filters to a spectrum.""" spectrum = default_filters(spectrum) spectrum = add_precursor_mz(spectrum) spectrum = normalize_intensities(spectrum) spectrum = select_by_mz(spectrum, mz_from=10.0, mz_to=1000.0) spectrum = select_by_relative_intensity( spectrum, intensity_from=0.01 ) return spectrum # Load and process spectra raw_spectra = list(load_from_mgf("raw_data.mgf")) processed = [process_spectrum(s) for s in raw_spectra] processed = [s for s in processed if s is not None] print(f"Processed {len(processed)}/{len(raw_spectra)} spectra") # Library matching from matchms.similarity import CosineGreedy library = list(load_from_mgf("reference_library.mgf")) library = [process_spectrum(s) for s in library if s is not None] cosine = CosineGreedy(tolerance=0.3) scores = cosine.matrix(processed, library) # Find best matches import numpy as np for i, spectrum in enumerate(processed): best_idx = np.argmax(scores[i]) best_score = scores[i][best_idx] if best_score > 0.7: match_name = library[best_idx].get("compound_name", "Unknown") print(f"Spectrum {i}: {match_name} (score: {best_score:.3f})")

Molecular Networking

from matchms.similarity import ModifiedCosine from matchms.networking import SimilarityNetwork import networkx as nx def build_molecular_network(spectra, min_score=0.6, min_matches=4): """Build a molecular network from MS/MS spectra.""" mod_cosine = ModifiedCosine(tolerance=0.3) # Calculate all-vs-all similarity scores_matrix = mod_cosine.matrix(spectra, spectra) # Build network G = nx.Graph() for i in range(len(spectra)): G.add_node(i, **spectra[i].metadata) for j in range(i + 1, len(spectra)): score = scores_matrix[i][j] if isinstance(score, dict): if score["score"] >= min_score and score["matches"] >= min_matches: G.add_edge(i, j, weight=score["score"]) elif score >= min_score: G.add_edge(i, j, weight=score) # Find connected components (molecular families) families = list(nx.connected_components(G)) print(f"Network: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges") print(f"Molecular families: {len(families)}") return G, families network, families = build_molecular_network(processed)

Configuration

ParameterDescriptionDefault
tolerancem/z tolerance for peak matching (Da)0.3
min_scoreMinimum similarity score threshold0.7
min_matched_peaksMinimum number of matched peaks4
mz_rangem/z range for analysis[10, 1000]
intensity_thresholdMinimum relative intensity to keep0.01
fingerprint_typeMolecular fingerprint algorithm"daylight"

Best Practices

  1. Always normalize intensities before comparison — Raw intensity scales vary between instruments and runs. Apply normalize_intensities() to scale the base peak to 1.0 before calculating similarity scores. Without normalization, intensity differences dominate over fragmentation pattern differences.

  2. Set appropriate m/z tolerance for your instrument — Use 0.01-0.05 Da for high-resolution data (Orbitrap, Q-TOF) and 0.3-0.5 Da for unit-resolution data (ion traps). Too-large tolerance causes false peak matches; too-small misses real matches due to calibration drift.

  3. Filter low-intensity noise peaks — Remove peaks below 1% relative intensity before matching. Noise peaks inflate the total peak count and reduce cosine similarity scores by introducing spurious mismatches.

  4. Use Modified Cosine for analog searching — Standard cosine fails when comparing spectra with different precursor masses. Modified Cosine shifts fragment peaks by the precursor mass difference, enabling discovery of structural analogs with mass shifts from modifications.

  5. Validate top hits manually — Automated spectral matching produces false positives, especially at similarity scores between 0.6-0.8. Always inspect mirror plots of the query vs. reference spectrum for your top identifications before reporting them as confirmed matches.

Common Issues

All similarity scores are zero — This usually means spectra have no overlapping peaks within the tolerance window. Check that both query and reference spectra have been processed with the same filters and that the m/z tolerance is appropriate for your mass resolution. Also verify that spectra actually contain peaks after filtering.

Memory errors with large spectral libraries — Computing an all-vs-all similarity matrix for 10,000+ spectra requires significant memory. Use cosine.matrix() with sparse output or calculate similarities in batches. For molecular networking, consider pre-filtering spectra by precursor mass ranges before computing the full matrix.

Metadata fields missing after import — Different MGF file formats use different field names (e.g., PEPMASS vs PRECURSOR_MZ). Use default_filters() which harmonizes common field name variants, and check spectrum.metadata.keys() to see what fields were actually imported from your specific file format.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates