N

Nlp Engineer Guru

Production-ready agent that handles building, production, systems, implementing. Includes structured workflows, validation checks, and reusable patterns for data ai.

AgentClipticsdata aiv1.0.0MIT
0 views0 copies

NLP Engineer Guru

An agent for designing and implementing production NLP systems covering text preprocessing, transformer fine-tuning, multilingual support, and building scalable natural language processing pipelines for real-time applications.

When to Use This Agent

Choose NLP Engineer Guru when:

  • Fine-tuning transformer models for text classification, NER, or generation
  • Building text processing pipelines with proper tokenization and normalization
  • Implementing multilingual NLP systems with cross-lingual transfer
  • Designing RAG (Retrieval-Augmented Generation) architectures
  • Optimizing NLP model inference for production latency requirements

Consider alternatives when:

  • Building computer vision systems (use a vision-focused agent)
  • Working with structured data without text (use a data science agent)
  • Using pre-built LLM APIs without model training (use an API integration agent)

Quick Start

# .claude/agents/nlp-engineer-guru.yml name: NLP Engineer Guru model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a senior NLP engineer. Design and build production NLP systems covering text preprocessing, model fine-tuning, and inference optimization. Prioritize accuracy, latency, and multilingual support.

Example invocation:

claude --agent nlp-engineer-guru "Fine-tune a BERT model for multi-label intent classification on our customer support tickets with 15 intent categories and deploy with < 30ms latency"

Core Concepts

NLP Task Selection Guide

TaskModelsTypical Approach
Text ClassificationBERT, RoBERTa, DistilBERTFine-tune on labeled data
Named Entity RecognitionBERT-NER, SpaCy, FlairSequence labeling + CRF
Sentiment AnalysisBERT, XLNetFine-tune or prompt-based
Text SummarizationBART, T5, PegasusAbstractive fine-tuning
Question AnsweringBERT-QA, DeBERTaExtractive or generative
TranslationmBART, NLLB, MarianMTFine-tune for language pair
Semantic SearchSentence-BERT, E5Embedding + vector similarity
RAGRetriever + GeneratorChunking + retrieval + generation

Text Processing Pipeline

from transformers import pipeline, AutoTokenizer # Standard NLP preprocessing pipeline def preprocess_text(text: str) -> str: text = text.strip() text = normalize_unicode(text) # NFKC normalization text = remove_html_tags(text) # Strip HTML text = fix_encoding_errors(text) # Handle mojibake text = normalize_whitespace(text) # Collapse whitespace return text # Tokenization with handling for production edge cases tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") def tokenize_safely(text: str, max_length: int = 512): return tokenizer( preprocess_text(text), truncation=True, max_length=max_length, padding="max_length", return_tensors="pt" )

RAG Architecture

Query → Embed → Retrieve → Rerank → Generate → Response
  │       │        │          │         │
  Text  Model    Vector     Cross-    LLM with
  clean  embed    search    encoder   retrieved
  norm   (E5)     (FAISS)   (score)   context

Configuration

ParameterDescriptionDefault
base_modelPre-trained model for fine-tuningbert-base-uncased
max_seq_lengthMaximum token sequence length512
batch_sizeTraining/inference batch size32
learning_rateFine-tuning learning rate2e-5
embedding_modelModel for text embeddingsall-MiniLM-L6-v2
vector_storeVector database for searchFAISS
language_supportTarget languagesEnglish

Best Practices

  1. Preprocess text consistently between training and inference. The most common NLP bug is different text preprocessing in training versus production. Apply the same normalization, tokenization, and cleaning steps in both paths. Build a shared preprocessing module and use it everywhere. Even subtle differences like Unicode normalization (NFC vs NFKC) or whitespace handling can degrade model performance.

  2. Start with the smallest model that meets accuracy requirements. DistilBERT runs 2x faster than BERT with 97% of its accuracy on most tasks. For simple classification, even TF-IDF with logistic regression may suffice and runs 100x faster. Benchmark smaller models first and only scale up when there's a measurable accuracy gap that justifies the latency and cost increase.

  3. Use stratified sampling for imbalanced text datasets. NLP datasets often have severe class imbalance (1,000 positive reviews vs 50 negative). Without stratified splits, your test set may not contain enough minority class examples to evaluate properly. Use stratified train/test splits, and consider oversampling, undersampling, or class weights during training to handle imbalance.

  4. Chunk documents thoughtfully for RAG applications. Splitting documents by fixed token count creates chunks that break mid-sentence or mid-paragraph, losing context. Use semantic chunking that respects document structure: split on paragraph boundaries, section headers, or semantic shifts. Overlap chunks by 10-20% to preserve context at boundaries. Test retrieval quality with different chunk sizes.

  5. Evaluate with task-appropriate metrics, not just accuracy. For classification, use F1-score (especially macro-F1 for imbalanced classes). For NER, use entity-level F1 rather than token-level. For generation, use ROUGE and human evaluation. For search, use MRR or NDCG. Accuracy alone hides poor performance on minority classes and edge cases that matter most in production.

Common Issues

Fine-tuned model overfits on small datasets. NLP fine-tuning with fewer than 1,000 examples per class is prone to overfitting. Apply regularization: increase dropout (0.2-0.3), use weight decay (0.01), freeze lower transformer layers, and train for fewer epochs with early stopping. Data augmentation techniques—back-translation, synonym replacement, paraphrasing with an LLM—can effectively double small datasets.

Multilingual model performs poorly on specific languages. Multilingual models like mBERT allocate capacity unevenly across languages. Low-resource languages get worse representations. Fine-tune on target language data, even a small amount. For critical languages, consider language-specific models (CamemBERT for French, BERTje for Dutch) which outperform multilingual models significantly on their respective languages.

Inference latency exceeds production SLA. Profile the full pipeline: tokenization, model inference, and postprocessing. Apply optimizations in order: batch multiple requests, quantize to INT8 (usually < 1% accuracy loss), export to ONNX Runtime, and use dynamic sequence length instead of padding to max. For BERT-class models, these optimizations together typically achieve 4-8x speedup over naive PyTorch inference.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates