NLP Engineer Guru

An agent for designing and implementing production NLP systems covering text preprocessing, transformer fine-tuning, multilingual support, and building scalable natural language processing pipelines for real-time applications.

When to Use This Agent

Choose NLP Engineer Guru when:

Fine-tuning transformer models for text classification, NER, or generation
Building text processing pipelines with proper tokenization and normalization
Implementing multilingual NLP systems with cross-lingual transfer
Designing RAG (Retrieval-Augmented Generation) architectures
Optimizing NLP model inference for production latency requirements

Consider alternatives when:

Building computer vision systems (use a vision-focused agent)
Working with structured data without text (use a data science agent)
Using pre-built LLM APIs without model training (use an API integration agent)

Quick Start


# .claude/agents/nlp-engineer-guru.yml
name: NLP Engineer Guru
model: claude-sonnet-4-20250514
tools:
  - Read
  - Write
  - Bash
  - Glob
  - Grep
prompt: |
  You are a senior NLP engineer. Design and build production NLP
  systems covering text preprocessing, model fine-tuning, and
  inference optimization. Prioritize accuracy, latency, and
  multilingual support.

Example invocation:


claude --agent nlp-engineer-guru "Fine-tune a BERT model for
  multi-label intent classification on our customer support
  tickets with 15 intent categories and deploy with < 30ms latency"

Core Concepts

NLP Task Selection Guide

Task	Models	Typical Approach
Text Classification	BERT, RoBERTa, DistilBERT	Fine-tune on labeled data
Named Entity Recognition	BERT-NER, SpaCy, Flair	Sequence labeling + CRF
Sentiment Analysis	BERT, XLNet	Fine-tune or prompt-based
Text Summarization	BART, T5, Pegasus	Abstractive fine-tuning
Question Answering	BERT-QA, DeBERTa	Extractive or generative
Translation	mBART, NLLB, MarianMT	Fine-tune for language pair
Semantic Search	Sentence-BERT, E5	Embedding + vector similarity
RAG	Retriever + Generator	Chunking + retrieval + generation

Text Processing Pipeline


from transformers import pipeline, AutoTokenizer

# Standard NLP preprocessing pipeline
def preprocess_text(text: str) -> str:
    text = text.strip()
    text = normalize_unicode(text)        # NFKC normalization
    text = remove_html_tags(text)         # Strip HTML
    text = fix_encoding_errors(text)      # Handle mojibake
    text = normalize_whitespace(text)     # Collapse whitespace
    return text

# Tokenization with handling for production edge cases
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_safely(text: str, max_length: int = 512):
    return tokenizer(
        preprocess_text(text),
        truncation=True,
        max_length=max_length,
        padding="max_length",
        return_tensors="pt"
    )

RAG Architecture

Query → Embed → Retrieve → Rerank → Generate → Response
  │       │        │          │         │
  Text  Model    Vector     Cross-    LLM with
  clean  embed    search    encoder   retrieved
  norm   (E5)     (FAISS)   (score)   context

Configuration

Parameter	Description	Default
`base_model`	Pre-trained model for fine-tuning	bert-base-uncased
`max_seq_length`	Maximum token sequence length	512
`batch_size`	Training/inference batch size	32
`learning_rate`	Fine-tuning learning rate	2e-5
`embedding_model`	Model for text embeddings	all-MiniLM-L6-v2
`vector_store`	Vector database for search	FAISS
`language_support`	Target languages	English

Best Practices

Preprocess text consistently between training and inference. The most common NLP bug is different text preprocessing in training versus production. Apply the same normalization, tokenization, and cleaning steps in both paths. Build a shared preprocessing module and use it everywhere. Even subtle differences like Unicode normalization (NFC vs NFKC) or whitespace handling can degrade model performance.
Start with the smallest model that meets accuracy requirements. DistilBERT runs 2x faster than BERT with 97% of its accuracy on most tasks. For simple classification, even TF-IDF with logistic regression may suffice and runs 100x faster. Benchmark smaller models first and only scale up when there's a measurable accuracy gap that justifies the latency and cost increase.
Use stratified sampling for imbalanced text datasets. NLP datasets often have severe class imbalance (1,000 positive reviews vs 50 negative). Without stratified splits, your test set may not contain enough minority class examples to evaluate properly. Use stratified train/test splits, and consider oversampling, undersampling, or class weights during training to handle imbalance.
Chunk documents thoughtfully for RAG applications. Splitting documents by fixed token count creates chunks that break mid-sentence or mid-paragraph, losing context. Use semantic chunking that respects document structure: split on paragraph boundaries, section headers, or semantic shifts. Overlap chunks by 10-20% to preserve context at boundaries. Test retrieval quality with different chunk sizes.
Evaluate with task-appropriate metrics, not just accuracy. For classification, use F1-score (especially macro-F1 for imbalanced classes). For NER, use entity-level F1 rather than token-level. For generation, use ROUGE and human evaluation. For search, use MRR or NDCG. Accuracy alone hides poor performance on minority classes and edge cases that matter most in production.

Common Issues

Fine-tuned model overfits on small datasets. NLP fine-tuning with fewer than 1,000 examples per class is prone to overfitting. Apply regularization: increase dropout (0.2-0.3), use weight decay (0.01), freeze lower transformer layers, and train for fewer epochs with early stopping. Data augmentation techniques—back-translation, synonym replacement, paraphrasing with an LLM—can effectively double small datasets.

Multilingual model performs poorly on specific languages. Multilingual models like mBERT allocate capacity unevenly across languages. Low-resource languages get worse representations. Fine-tune on target language data, even a small amount. For critical languages, consider language-specific models (CamemBERT for French, BERTje for Dutch) which outperform multilingual models significantly on their respective languages.

Inference latency exceeds production SLA. Profile the full pipeline: tokenization, model inference, and postprocessing. Apply optimizations in order: batch multiple requests, quantize to INT8 (usually < 1% accuracy loss), export to ONNX Runtime, and use dynamic sequence length instead of padding to max. For BERT-class models, these optimizations together typically achieve 4-8x speedup over naive PyTorch inference.

⚠️ Loading Issue

Nlp Engineer Guru

NLP Engineer Guru

When to Use This Agent

Quick Start

Core Concepts

NLP Task Selection Guide

Text Processing Pipeline

RAG Architecture

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner