Advanced String Platform
Production-ready skill that handles query, string, protein, interactions. Includes structured workflows, validation checks, and reusable patterns for scientific.
Advanced String Platform
Process, analyze, and transform text strings with advanced pattern matching, fuzzy matching, natural language parsing, and string distance algorithms using Python. This skill covers regex patterns, string similarity metrics, text normalization, template rendering, structured extraction, and high-performance string operations.
When to Use This Skill
Choose Advanced String Platform when you need to:
- Match, extract, or transform text using complex regex patterns
- Compute string similarity and fuzzy matching for deduplication or search
- Parse and normalize messy text data (names, addresses, identifiers)
- Build text processing pipelines with high-performance string operations
Consider alternatives when:
- You need full NLP analysis (use spaCy or NLTK)
- You need machine learning on text (use transformers or scikit-learn TF-IDF)
- You need document parsing from PDFs or HTML (use Beautiful Soup or pdfplumber)
Quick Start
pip install python-Levenshtein rapidfuzz regex
import re from difflib import SequenceMatcher from rapidfuzz import fuzz, process # Pattern matching with named groups text = "Order #ORD-2025-4892 placed on 2025-03-13 for $149.99" pattern = r"Order #(?P<order_id>ORD-\d{4}-\d+) placed on (?P<date>\d{4}-\d{2}-\d{2}) for \$(?P<amount>[\d.]+)" match = re.search(pattern, text) if match: print(f"Order: {match.group('order_id')}") print(f"Date: {match.group('date')}") print(f"Amount: ${match.group('amount')}") # Fuzzy matching company_names = ["Google LLC", "Apple Inc.", "Microsoft Corporation", "Amazon.com Inc.", "Meta Platforms"] query = "googl" matches = process.extract(query, company_names, scorer=fuzz.WRatio, limit=3) for name, score, idx in matches: print(f" {name}: {score:.0f}%") # String similarity s1 = "kitten" s2 = "sitting" ratio = SequenceMatcher(None, s1, s2).ratio() print(f"\nSimilarity '{s1}' vs '{s2}': {ratio:.3f}")
Core Concepts
String Distance Metrics
| Metric | Best For | Library |
|---|---|---|
| Levenshtein distance | Typo correction, spell checking | rapidfuzz, python-Levenshtein |
| Jaro-Winkler | Name matching (emphasizes prefix) | rapidfuzz |
| Token sort ratio | Word-order-independent matching | rapidfuzz.fuzz |
| Token set ratio | Subset matching (handles extra words) | rapidfuzz.fuzz |
| SequenceMatcher | Longest common subsequence | difflib (stdlib) |
| Cosine similarity | Document similarity (TF-IDF vectors) | sklearn |
| Soundex / Metaphone | Phonetic matching | jellyfish |
Text Normalization Pipeline
import re import unicodedata from typing import List, Callable class TextNormalizer: """Configurable text normalization pipeline.""" def __init__(self): self.steps: List[Callable[[str], str]] = [] def add_step(self, func: Callable[[str], str]): self.steps.append(func) return self def normalize(self, text: str) -> str: for step in self.steps: text = step(text) return text # Built-in normalization steps @staticmethod def lowercase(text: str) -> str: return text.lower() @staticmethod def strip_accents(text: str) -> str: nfkd = unicodedata.normalize('NFKD', text) return ''.join(c for c in nfkd if not unicodedata.combining(c)) @staticmethod def collapse_whitespace(text: str) -> str: return re.sub(r'\s+', ' ', text).strip() @staticmethod def remove_punctuation(text: str) -> str: return re.sub(r'[^\w\s]', '', text) @staticmethod def normalize_unicode(text: str) -> str: return unicodedata.normalize('NFC', text) # Usage normalizer = TextNormalizer() normalizer.add_step(TextNormalizer.normalize_unicode) normalizer.add_step(TextNormalizer.strip_accents) normalizer.add_step(TextNormalizer.lowercase) normalizer.add_step(TextNormalizer.collapse_whitespace) dirty = " Café résumé — hello WORLD " clean = normalizer.normalize(dirty) print(f"'{dirty}' → '{clean}'") # Output: ' Café résumé — hello WORLD ' → 'cafe resume — hello world' # Entity extraction with regex def extract_entities(text): """Extract common entities from text.""" patterns = { 'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', 'phone': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', 'url': r'https?://[^\s<>"{}|\\^`\[\]]+', 'date': r'\b\d{4}-\d{2}-\d{2}\b', 'ip': r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', } results = {} for entity_type, pattern in patterns.items(): matches = re.findall(pattern, text, re.IGNORECASE) if matches: results[entity_type] = matches return results sample = "Contact [email protected] or call 555-123-4567. Visit https://example.com" print(extract_entities(sample))
Configuration
| Parameter | Description | Default |
|---|---|---|
scorer | Fuzzy matching algorithm | fuzz.WRatio |
score_cutoff | Minimum similarity score (0-100) | 80 |
limit | Maximum number of fuzzy matches returned | 5 |
case_sensitive | Whether matching is case-sensitive | false |
unicode_normalize | Unicode normalization form (NFC, NFKD) | "NFC" |
regex_flags | Default regex flags | re.IGNORECASE |
max_distance | Maximum edit distance for matching | 2 |
processor | Pre-processing function for fuzzy matching | rapidfuzz.utils.default_process |
Best Practices
-
Use
rapidfuzzinstead offuzzywuzzyfor performance —rapidfuzzis a drop-in replacement that's 10-100x faster because it's implemented in C++. The API is identical (fuzz.ratio,fuzz.WRatio,process.extract), so switching requires only changing the import statement. -
Choose the right fuzzy matching scorer for your use case —
fuzz.ratiofor exact substring matching,fuzz.token_sort_ratiowhen word order varies ("John Smith" vs "Smith, John"),fuzz.token_set_ratiowhen one string has extra words, andfuzz.WRatioas a general-purpose scorer that tries multiple strategies. -
Compile regex patterns that are used repeatedly —
re.compile(pattern)creates a reusable pattern object that skips the parsing step. In loops processing thousands of strings, compiled patterns are noticeably faster. Store compiled patterns as module-level constants or class attributes. -
Normalize text before comparing — Lowercasing, stripping accents, collapsing whitespace, and removing punctuation should happen before any similarity computation. Without normalization, "Café" and "cafe" appear different despite representing the same word. Build a normalization pipeline and apply it consistently.
-
Use named groups in regex for maintainable patterns —
(?P<name>pattern)makes extracted data self-documenting:match.group('name')instead ofmatch.group(3). Named groups survive pattern refactoring and are essential when patterns have many capture groups.
Common Issues
Regex matches too greedily or too little — Quantifiers like .* are greedy by default, matching as much as possible. Use .*? for non-greedy matching when you want the shortest match. For alternation, put longer alternatives first: (https|http) not (http|https) — the latter may match http when https was intended.
Fuzzy matching returns false positives for short strings — Short strings have high similarity scores by chance: "a" vs "ab" is 67% similar. Set a score_cutoff appropriate for your string lengths, or use a minimum length filter. For strings under 5 characters, consider exact matching or Soundex instead of Levenshtein.
Unicode comparison fails for visually identical strings — Characters like "é" can be represented as a single code point or as "e" + combining accent mark. Normalize with unicodedata.normalize('NFC', text) before comparison. Also handle full-width vs half-width characters with NFKC normalization for CJK text.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.