M

Molfeat Dynamic

Battle-tested skill for molecular, featurization, featurizers, ecfp. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Molfeat Dynamic

Convert chemical structures into numerical representations for machine learning using Molfeat, a unified molecular featurization library. This skill covers fingerprint generation, pre-trained embeddings, descriptor calculation, and building featurization pipelines for drug discovery and cheminformatics.

When to Use This Skill

Choose Molfeat Dynamic when you need to:

  • Generate molecular fingerprints (Morgan, MACCS, ECFP) for similarity searching
  • Use pre-trained molecular embeddings (ChemBERTa, MolBERT) for ML models
  • Calculate physicochemical descriptors for QSAR/QSPR modeling
  • Build standardized featurization pipelines that combine multiple representations

Consider alternatives when:

  • You need 3D molecular descriptors from crystal structures (use RDKit 3D or e3nn)
  • You need protein-ligand interaction fingerprints (use ProLIF or PLIP)
  • You need graph neural network representations (use PyTorch Geometric directly)

Quick Start

# Install molfeat pip install molfeat[all]
from molfeat.trans.fp import FPVecTransformer from molfeat.trans.pretrained import PretrainedMolTransformer # Generate Morgan fingerprints morgan = FPVecTransformer(kind="morgan", n_bits=2048) smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(=O)Oc1ccccc1C(=O)O"] fps = morgan(smiles) print(f"Fingerprint shape: {fps.shape}") # Use a pre-trained embedding chembert = PretrainedMolTransformer(kind="ChemBERTa-77M-MLM") embeddings = chembert(smiles) print(f"Embedding shape: {embeddings.shape}")

Core Concepts

Available Featurizers

CategoryNameDimensionDescription
Fingerprintmorgan2048Circular fingerprint (ECFP equivalent)
Fingerprintmaccs167MACCS structural keys
Fingerprinttopological2048Topological path fingerprint
Fingerprintavalon512Avalon fingerprint
Descriptordesc2d~200RDKit 2D descriptors
Descriptormordred~1600Mordred descriptor set
PretrainedChemBERTa-77M-MLM384Transformer embedding
Pretrainedgin_supervised300GIN graph neural network

ML Pipeline with Molecular Features

from molfeat.trans.fp import FPVecTransformer from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score import pandas as pd import numpy as np # Example: predict aqueous solubility class data = pd.read_csv("solubility_data.csv") smiles = data["smiles"].tolist() labels = data["soluble"].values # Try different featurizations featurizers = { "Morgan 2048": FPVecTransformer(kind="morgan", n_bits=2048), "MACCS": FPVecTransformer(kind="maccs"), "RDKit 2D": FPVecTransformer(kind="desc2d"), } for name, featurizer in featurizers.items(): features = featurizer(smiles) # Handle NaN values from failed molecules valid_mask = ~np.isnan(features).any(axis=1) X = features[valid_mask] y = labels[valid_mask] clf = RandomForestClassifier(n_estimators=100, random_state=42) scores = cross_val_score(clf, X, y, cv=5, scoring="roc_auc") print(f"{name:15s}: AUC = {scores.mean():.3f} ± {scores.std():.3f}")

Combined Feature Representations

from molfeat.trans.fp import FPVecTransformer from molfeat.trans.concat import FeatConcat import numpy as np # Combine multiple featurizations combined = FeatConcat([ FPVecTransformer(kind="morgan", n_bits=1024), FPVecTransformer(kind="maccs"), FPVecTransformer(kind="desc2d"), ]) smiles = ["CCO", "CC(=O)O", "c1ccccc1"] features = combined(smiles) print(f"Combined feature shape: {features.shape}") # Similarity search with combined features from sklearn.metrics.pairwise import cosine_similarity query = combined(["CC(=O)Oc1ccccc1C(=O)O"]) # Aspirin similarities = cosine_similarity(query, features)[0] for smi, sim in zip(smiles, similarities): print(f"{smi:20s} similarity: {sim:.3f}")

Configuration

ParameterDescriptionDefault
kindFeaturizer type (morgan, maccs, desc2d, etc.)Required
n_bitsFingerprint bit vector length2048
radiusMorgan fingerprint radius2
dtypeOutput data typenp.float32
batch_sizeMolecules per processing batch256
n_jobsParallel workers1

Best Practices

  1. Benchmark multiple representations — No single molecular representation is best for all tasks. Compare fingerprints (Morgan, MACCS), descriptors (RDKit 2D), and pre-trained embeddings on your specific dataset. The best featurization depends on the property you're predicting.

  2. Handle invalid molecules gracefully — Some SMILES strings fail to parse or produce NaN descriptors. Use try/except around featurization calls and filter out rows with NaN values before training ML models. Track the failure rate — if more than 5% fail, investigate your SMILES quality.

  3. Standardize molecules before featurizing — Different tautomers or charge states of the same molecule produce different fingerprints. Use RDKit's MolStandardize to canonicalize structures before computing features for consistent results.

  4. Scale descriptor features but not fingerprints — Physicochemical descriptors (molecular weight, LogP) have different scales and benefit from StandardScaler normalization. Binary fingerprints (Morgan, MACCS) should not be scaled — they're already on a consistent 0/1 scale.

  5. Cache featurizations for large datasets — Computing pre-trained embeddings on 100K+ molecules takes significant time. Save computed features to disk (NumPy .npy or pickle) and reload them for subsequent model training iterations rather than recomputing each time.

Common Issues

Pre-trained model download fails — Molfeat downloads model weights on first use, which can fail behind corporate firewalls or proxies. Set MOLFEAT_CACHE_DIR to a writable directory, or download models manually and point to the local path. Check network access to the HuggingFace model hub.

Feature dimension mismatch between train and test — This happens when train and test sets contain molecules that produce different descriptor counts (e.g., some descriptors undefined for certain structures). Always fit your featurizer on training data first, then transform test data with the same featurizer object to ensure consistent dimensions.

Out of memory with large molecule batches — Computing embeddings for 100K+ molecules at once exhausts RAM. Process in batches using featurizer.batch_size = 1000 or manually split your SMILES list and concatenate results. Pre-trained transformer models are especially memory-intensive.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates