Nlp Engineer Guru
Production-ready agent that handles building, production, systems, implementing. Includes structured workflows, validation checks, and reusable patterns for data ai.
NLP Engineer Guru
An agent for designing and implementing production NLP systems covering text preprocessing, transformer fine-tuning, multilingual support, and building scalable natural language processing pipelines for real-time applications.
When to Use This Agent
Choose NLP Engineer Guru when:
- Fine-tuning transformer models for text classification, NER, or generation
- Building text processing pipelines with proper tokenization and normalization
- Implementing multilingual NLP systems with cross-lingual transfer
- Designing RAG (Retrieval-Augmented Generation) architectures
- Optimizing NLP model inference for production latency requirements
Consider alternatives when:
- Building computer vision systems (use a vision-focused agent)
- Working with structured data without text (use a data science agent)
- Using pre-built LLM APIs without model training (use an API integration agent)
Quick Start
# .claude/agents/nlp-engineer-guru.yml name: NLP Engineer Guru model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a senior NLP engineer. Design and build production NLP systems covering text preprocessing, model fine-tuning, and inference optimization. Prioritize accuracy, latency, and multilingual support.
Example invocation:
claude --agent nlp-engineer-guru "Fine-tune a BERT model for multi-label intent classification on our customer support tickets with 15 intent categories and deploy with < 30ms latency"
Core Concepts
NLP Task Selection Guide
| Task | Models | Typical Approach |
|---|---|---|
| Text Classification | BERT, RoBERTa, DistilBERT | Fine-tune on labeled data |
| Named Entity Recognition | BERT-NER, SpaCy, Flair | Sequence labeling + CRF |
| Sentiment Analysis | BERT, XLNet | Fine-tune or prompt-based |
| Text Summarization | BART, T5, Pegasus | Abstractive fine-tuning |
| Question Answering | BERT-QA, DeBERTa | Extractive or generative |
| Translation | mBART, NLLB, MarianMT | Fine-tune for language pair |
| Semantic Search | Sentence-BERT, E5 | Embedding + vector similarity |
| RAG | Retriever + Generator | Chunking + retrieval + generation |
Text Processing Pipeline
from transformers import pipeline, AutoTokenizer # Standard NLP preprocessing pipeline def preprocess_text(text: str) -> str: text = text.strip() text = normalize_unicode(text) # NFKC normalization text = remove_html_tags(text) # Strip HTML text = fix_encoding_errors(text) # Handle mojibake text = normalize_whitespace(text) # Collapse whitespace return text # Tokenization with handling for production edge cases tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") def tokenize_safely(text: str, max_length: int = 512): return tokenizer( preprocess_text(text), truncation=True, max_length=max_length, padding="max_length", return_tensors="pt" )
RAG Architecture
Query → Embed → Retrieve → Rerank → Generate → Response
│ │ │ │ │
Text Model Vector Cross- LLM with
clean embed search encoder retrieved
norm (E5) (FAISS) (score) context
Configuration
| Parameter | Description | Default |
|---|---|---|
base_model | Pre-trained model for fine-tuning | bert-base-uncased |
max_seq_length | Maximum token sequence length | 512 |
batch_size | Training/inference batch size | 32 |
learning_rate | Fine-tuning learning rate | 2e-5 |
embedding_model | Model for text embeddings | all-MiniLM-L6-v2 |
vector_store | Vector database for search | FAISS |
language_support | Target languages | English |
Best Practices
-
Preprocess text consistently between training and inference. The most common NLP bug is different text preprocessing in training versus production. Apply the same normalization, tokenization, and cleaning steps in both paths. Build a shared preprocessing module and use it everywhere. Even subtle differences like Unicode normalization (NFC vs NFKC) or whitespace handling can degrade model performance.
-
Start with the smallest model that meets accuracy requirements. DistilBERT runs 2x faster than BERT with 97% of its accuracy on most tasks. For simple classification, even TF-IDF with logistic regression may suffice and runs 100x faster. Benchmark smaller models first and only scale up when there's a measurable accuracy gap that justifies the latency and cost increase.
-
Use stratified sampling for imbalanced text datasets. NLP datasets often have severe class imbalance (1,000 positive reviews vs 50 negative). Without stratified splits, your test set may not contain enough minority class examples to evaluate properly. Use stratified train/test splits, and consider oversampling, undersampling, or class weights during training to handle imbalance.
-
Chunk documents thoughtfully for RAG applications. Splitting documents by fixed token count creates chunks that break mid-sentence or mid-paragraph, losing context. Use semantic chunking that respects document structure: split on paragraph boundaries, section headers, or semantic shifts. Overlap chunks by 10-20% to preserve context at boundaries. Test retrieval quality with different chunk sizes.
-
Evaluate with task-appropriate metrics, not just accuracy. For classification, use F1-score (especially macro-F1 for imbalanced classes). For NER, use entity-level F1 rather than token-level. For generation, use ROUGE and human evaluation. For search, use MRR or NDCG. Accuracy alone hides poor performance on minority classes and edge cases that matter most in production.
Common Issues
Fine-tuned model overfits on small datasets. NLP fine-tuning with fewer than 1,000 examples per class is prone to overfitting. Apply regularization: increase dropout (0.2-0.3), use weight decay (0.01), freeze lower transformer layers, and train for fewer epochs with early stopping. Data augmentation techniques—back-translation, synonym replacement, paraphrasing with an LLM—can effectively double small datasets.
Multilingual model performs poorly on specific languages. Multilingual models like mBERT allocate capacity unevenly across languages. Low-resource languages get worse representations. Fine-tune on target language data, even a small amount. For critical languages, consider language-specific models (CamemBERT for French, BERTje for Dutch) which outperform multilingual models significantly on their respective languages.
Inference latency exceeds production SLA. Profile the full pipeline: tokenization, model inference, and postprocessing. Apply optimizations in order: batch multiple requests, quantize to INT8 (usually < 1% accuracy loss), export to ONNX Runtime, and use dynamic sequence length instead of padding to max. For BERT-class models, these optimizations together typically achieve 4-8x speedup over naive PyTorch inference.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.