A

Advanced String Platform

Production-ready skill that handles query, string, protein, interactions. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Advanced String Platform

Process, analyze, and transform text strings with advanced pattern matching, fuzzy matching, natural language parsing, and string distance algorithms using Python. This skill covers regex patterns, string similarity metrics, text normalization, template rendering, structured extraction, and high-performance string operations.

When to Use This Skill

Choose Advanced String Platform when you need to:

  • Match, extract, or transform text using complex regex patterns
  • Compute string similarity and fuzzy matching for deduplication or search
  • Parse and normalize messy text data (names, addresses, identifiers)
  • Build text processing pipelines with high-performance string operations

Consider alternatives when:

  • You need full NLP analysis (use spaCy or NLTK)
  • You need machine learning on text (use transformers or scikit-learn TF-IDF)
  • You need document parsing from PDFs or HTML (use Beautiful Soup or pdfplumber)

Quick Start

pip install python-Levenshtein rapidfuzz regex
import re from difflib import SequenceMatcher from rapidfuzz import fuzz, process # Pattern matching with named groups text = "Order #ORD-2025-4892 placed on 2025-03-13 for $149.99" pattern = r"Order #(?P<order_id>ORD-\d{4}-\d+) placed on (?P<date>\d{4}-\d{2}-\d{2}) for \$(?P<amount>[\d.]+)" match = re.search(pattern, text) if match: print(f"Order: {match.group('order_id')}") print(f"Date: {match.group('date')}") print(f"Amount: ${match.group('amount')}") # Fuzzy matching company_names = ["Google LLC", "Apple Inc.", "Microsoft Corporation", "Amazon.com Inc.", "Meta Platforms"] query = "googl" matches = process.extract(query, company_names, scorer=fuzz.WRatio, limit=3) for name, score, idx in matches: print(f" {name}: {score:.0f}%") # String similarity s1 = "kitten" s2 = "sitting" ratio = SequenceMatcher(None, s1, s2).ratio() print(f"\nSimilarity '{s1}' vs '{s2}': {ratio:.3f}")

Core Concepts

String Distance Metrics

MetricBest ForLibrary
Levenshtein distanceTypo correction, spell checkingrapidfuzz, python-Levenshtein
Jaro-WinklerName matching (emphasizes prefix)rapidfuzz
Token sort ratioWord-order-independent matchingrapidfuzz.fuzz
Token set ratioSubset matching (handles extra words)rapidfuzz.fuzz
SequenceMatcherLongest common subsequencedifflib (stdlib)
Cosine similarityDocument similarity (TF-IDF vectors)sklearn
Soundex / MetaphonePhonetic matchingjellyfish

Text Normalization Pipeline

import re import unicodedata from typing import List, Callable class TextNormalizer: """Configurable text normalization pipeline.""" def __init__(self): self.steps: List[Callable[[str], str]] = [] def add_step(self, func: Callable[[str], str]): self.steps.append(func) return self def normalize(self, text: str) -> str: for step in self.steps: text = step(text) return text # Built-in normalization steps @staticmethod def lowercase(text: str) -> str: return text.lower() @staticmethod def strip_accents(text: str) -> str: nfkd = unicodedata.normalize('NFKD', text) return ''.join(c for c in nfkd if not unicodedata.combining(c)) @staticmethod def collapse_whitespace(text: str) -> str: return re.sub(r'\s+', ' ', text).strip() @staticmethod def remove_punctuation(text: str) -> str: return re.sub(r'[^\w\s]', '', text) @staticmethod def normalize_unicode(text: str) -> str: return unicodedata.normalize('NFC', text) # Usage normalizer = TextNormalizer() normalizer.add_step(TextNormalizer.normalize_unicode) normalizer.add_step(TextNormalizer.strip_accents) normalizer.add_step(TextNormalizer.lowercase) normalizer.add_step(TextNormalizer.collapse_whitespace) dirty = " Café résumé — hello WORLD " clean = normalizer.normalize(dirty) print(f"'{dirty}' → '{clean}'") # Output: ' Café résumé — hello WORLD ' → 'cafe resume — hello world' # Entity extraction with regex def extract_entities(text): """Extract common entities from text.""" patterns = { 'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', 'phone': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', 'url': r'https?://[^\s<>"{}|\\^`\[\]]+', 'date': r'\b\d{4}-\d{2}-\d{2}\b', 'ip': r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', } results = {} for entity_type, pattern in patterns.items(): matches = re.findall(pattern, text, re.IGNORECASE) if matches: results[entity_type] = matches return results sample = "Contact [email protected] or call 555-123-4567. Visit https://example.com" print(extract_entities(sample))

Configuration

ParameterDescriptionDefault
scorerFuzzy matching algorithmfuzz.WRatio
score_cutoffMinimum similarity score (0-100)80
limitMaximum number of fuzzy matches returned5
case_sensitiveWhether matching is case-sensitivefalse
unicode_normalizeUnicode normalization form (NFC, NFKD)"NFC"
regex_flagsDefault regex flagsre.IGNORECASE
max_distanceMaximum edit distance for matching2
processorPre-processing function for fuzzy matchingrapidfuzz.utils.default_process

Best Practices

  1. Use rapidfuzz instead of fuzzywuzzy for performancerapidfuzz is a drop-in replacement that's 10-100x faster because it's implemented in C++. The API is identical (fuzz.ratio, fuzz.WRatio, process.extract), so switching requires only changing the import statement.

  2. Choose the right fuzzy matching scorer for your use casefuzz.ratio for exact substring matching, fuzz.token_sort_ratio when word order varies ("John Smith" vs "Smith, John"), fuzz.token_set_ratio when one string has extra words, and fuzz.WRatio as a general-purpose scorer that tries multiple strategies.

  3. Compile regex patterns that are used repeatedlyre.compile(pattern) creates a reusable pattern object that skips the parsing step. In loops processing thousands of strings, compiled patterns are noticeably faster. Store compiled patterns as module-level constants or class attributes.

  4. Normalize text before comparing — Lowercasing, stripping accents, collapsing whitespace, and removing punctuation should happen before any similarity computation. Without normalization, "Café" and "cafe" appear different despite representing the same word. Build a normalization pipeline and apply it consistently.

  5. Use named groups in regex for maintainable patterns(?P<name>pattern) makes extracted data self-documenting: match.group('name') instead of match.group(3). Named groups survive pattern refactoring and are essential when patterns have many capture groups.

Common Issues

Regex matches too greedily or too little — Quantifiers like .* are greedy by default, matching as much as possible. Use .*? for non-greedy matching when you want the shortest match. For alternation, put longer alternatives first: (https|http) not (http|https) — the latter may match http when https was intended.

Fuzzy matching returns false positives for short strings — Short strings have high similarity scores by chance: "a" vs "ab" is 67% similar. Set a score_cutoff appropriate for your string lengths, or use a minimum length filter. For strings under 5 characters, consider exact matching or Soundex instead of Levenshtein.

Unicode comparison fails for visually identical strings — Characters like "é" can be represented as a single code point or as "e" + combining accent mark. Normalize with unicodedata.normalize('NFC', text) before comparison. Also handle full-width vs half-width characters with NFKC normalization for CJK text.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates