Molfeat Dynamic
Battle-tested skill for molecular, featurization, featurizers, ecfp. Includes structured workflows, validation checks, and reusable patterns for scientific.
Molfeat Dynamic
Convert chemical structures into numerical representations for machine learning using Molfeat, a unified molecular featurization library. This skill covers fingerprint generation, pre-trained embeddings, descriptor calculation, and building featurization pipelines for drug discovery and cheminformatics.
When to Use This Skill
Choose Molfeat Dynamic when you need to:
- Generate molecular fingerprints (Morgan, MACCS, ECFP) for similarity searching
- Use pre-trained molecular embeddings (ChemBERTa, MolBERT) for ML models
- Calculate physicochemical descriptors for QSAR/QSPR modeling
- Build standardized featurization pipelines that combine multiple representations
Consider alternatives when:
- You need 3D molecular descriptors from crystal structures (use RDKit 3D or e3nn)
- You need protein-ligand interaction fingerprints (use ProLIF or PLIP)
- You need graph neural network representations (use PyTorch Geometric directly)
Quick Start
# Install molfeat pip install molfeat[all]
from molfeat.trans.fp import FPVecTransformer from molfeat.trans.pretrained import PretrainedMolTransformer # Generate Morgan fingerprints morgan = FPVecTransformer(kind="morgan", n_bits=2048) smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(=O)Oc1ccccc1C(=O)O"] fps = morgan(smiles) print(f"Fingerprint shape: {fps.shape}") # Use a pre-trained embedding chembert = PretrainedMolTransformer(kind="ChemBERTa-77M-MLM") embeddings = chembert(smiles) print(f"Embedding shape: {embeddings.shape}")
Core Concepts
Available Featurizers
| Category | Name | Dimension | Description |
|---|---|---|---|
| Fingerprint | morgan | 2048 | Circular fingerprint (ECFP equivalent) |
| Fingerprint | maccs | 167 | MACCS structural keys |
| Fingerprint | topological | 2048 | Topological path fingerprint |
| Fingerprint | avalon | 512 | Avalon fingerprint |
| Descriptor | desc2d | ~200 | RDKit 2D descriptors |
| Descriptor | mordred | ~1600 | Mordred descriptor set |
| Pretrained | ChemBERTa-77M-MLM | 384 | Transformer embedding |
| Pretrained | gin_supervised | 300 | GIN graph neural network |
ML Pipeline with Molecular Features
from molfeat.trans.fp import FPVecTransformer from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score import pandas as pd import numpy as np # Example: predict aqueous solubility class data = pd.read_csv("solubility_data.csv") smiles = data["smiles"].tolist() labels = data["soluble"].values # Try different featurizations featurizers = { "Morgan 2048": FPVecTransformer(kind="morgan", n_bits=2048), "MACCS": FPVecTransformer(kind="maccs"), "RDKit 2D": FPVecTransformer(kind="desc2d"), } for name, featurizer in featurizers.items(): features = featurizer(smiles) # Handle NaN values from failed molecules valid_mask = ~np.isnan(features).any(axis=1) X = features[valid_mask] y = labels[valid_mask] clf = RandomForestClassifier(n_estimators=100, random_state=42) scores = cross_val_score(clf, X, y, cv=5, scoring="roc_auc") print(f"{name:15s}: AUC = {scores.mean():.3f} ± {scores.std():.3f}")
Combined Feature Representations
from molfeat.trans.fp import FPVecTransformer from molfeat.trans.concat import FeatConcat import numpy as np # Combine multiple featurizations combined = FeatConcat([ FPVecTransformer(kind="morgan", n_bits=1024), FPVecTransformer(kind="maccs"), FPVecTransformer(kind="desc2d"), ]) smiles = ["CCO", "CC(=O)O", "c1ccccc1"] features = combined(smiles) print(f"Combined feature shape: {features.shape}") # Similarity search with combined features from sklearn.metrics.pairwise import cosine_similarity query = combined(["CC(=O)Oc1ccccc1C(=O)O"]) # Aspirin similarities = cosine_similarity(query, features)[0] for smi, sim in zip(smiles, similarities): print(f"{smi:20s} similarity: {sim:.3f}")
Configuration
| Parameter | Description | Default |
|---|---|---|
kind | Featurizer type (morgan, maccs, desc2d, etc.) | Required |
n_bits | Fingerprint bit vector length | 2048 |
radius | Morgan fingerprint radius | 2 |
dtype | Output data type | np.float32 |
batch_size | Molecules per processing batch | 256 |
n_jobs | Parallel workers | 1 |
Best Practices
-
Benchmark multiple representations — No single molecular representation is best for all tasks. Compare fingerprints (Morgan, MACCS), descriptors (RDKit 2D), and pre-trained embeddings on your specific dataset. The best featurization depends on the property you're predicting.
-
Handle invalid molecules gracefully — Some SMILES strings fail to parse or produce NaN descriptors. Use
try/exceptaround featurization calls and filter out rows with NaN values before training ML models. Track the failure rate — if more than 5% fail, investigate your SMILES quality. -
Standardize molecules before featurizing — Different tautomers or charge states of the same molecule produce different fingerprints. Use RDKit's
MolStandardizeto canonicalize structures before computing features for consistent results. -
Scale descriptor features but not fingerprints — Physicochemical descriptors (molecular weight, LogP) have different scales and benefit from StandardScaler normalization. Binary fingerprints (Morgan, MACCS) should not be scaled — they're already on a consistent 0/1 scale.
-
Cache featurizations for large datasets — Computing pre-trained embeddings on 100K+ molecules takes significant time. Save computed features to disk (NumPy
.npyor pickle) and reload them for subsequent model training iterations rather than recomputing each time.
Common Issues
Pre-trained model download fails — Molfeat downloads model weights on first use, which can fail behind corporate firewalls or proxies. Set MOLFEAT_CACHE_DIR to a writable directory, or download models manually and point to the local path. Check network access to the HuggingFace model hub.
Feature dimension mismatch between train and test — This happens when train and test sets contain molecules that produce different descriptor counts (e.g., some descriptors undefined for certain structures). Always fit your featurizer on training data first, then transform test data with the same featurizer object to ensure consistent dimensions.
Out of memory with large molecule batches — Computing embeddings for 100K+ molecules at once exhausts RAM. Process in batches using featurizer.batch_size = 1000 or manually split your SMILES list and concatenate results. Pre-trained transformer models are especially memory-intensive.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.