Comprehensive Data Processing with NeMo Curator

Overview

NeMo Curator is NVIDIA's GPU-accelerated toolkit for preparing high-quality training data for large language models. Data quality is the single most impactful factor in LLM performance, and NeMo Curator provides the tools to transform raw web scrapes into clean, deduplicated, high-quality training corpora at scale. It delivers 16x faster fuzzy deduplication compared to CPU-based alternatives, 40% lower total cost of ownership, and near-linear scaling across GPU nodes.

The toolkit covers the complete data curation pipeline: quality filtering with 30+ heuristic filters, three levels of deduplication (exact, fuzzy via MinHash+LSH, and semantic via embeddings), PII redaction, language identification, toxicity filtering, and classifier-based quality scoring. Beyond text, NeMo Curator supports multi-modal data curation for images (aesthetic scoring, NSFW detection, CLIP embeddings), video (scene detection, clip extraction), and audio (ASR transcription, word error rate filtering).

NVIDIA used NeMo Curator internally to prepare training data for the Nemotron-4 model family, and the toolkit has been used by the open-source community to curate datasets like RedPajama v2 and subsets of The Pile. It is the production-grade choice for teams that need to process terabytes of data on GPU infrastructure.

When to Use

Preparing LLM training data from web scrapes like Common Crawl or WARC files
Deduplicating large text corpora where CPU-based deduplication is too slow
Building data curation pipelines that need to scale to terabytes of data
Filtering low-quality, toxic, or NSFW content from training datasets
Redacting personally identifiable information (PII) before model training
Curating multi-modal datasets combining text, images, video, and audio
Benchmarking data quality to determine impact on downstream model performance
Running GPU-accelerated NLP processing across a cluster of machines

Quick Start


# Text curation with CUDA 12 (recommended for GPU users)
uv pip install "nemo-curator[text_cuda12]"

# All modalities (text + image + video + audio)
uv pip install "nemo-curator[all_cuda12]"

# CPU-only installation (slower but no GPU required)
uv pip install "nemo-curator[cpu]"

# Verify installation
python -c "import nemo_curator; print(nemo_curator.__version__)"


# Minimal working pipeline: load, filter, deduplicate, save
from nemo_curator import ScoreFilter
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters import WordCountFilter
from nemo_curator.modules import ExactDuplicates
import pandas as pd

# Load raw data
df = pd.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "text": [
        "This is a high-quality document with substantial content that provides "
        "real value to readers and contains enough words to pass filtering.",
        "bad",  # Too short, will be filtered
        "This is a high-quality document with substantial content that provides "
        "real value to readers and contains enough words to pass filtering.",  # Duplicate
        "Another excellent piece of content with diverse vocabulary and "
        "meaningful information that adds to the training corpus.",
        "short",  # Too short
    ]
})
dataset = DocumentDataset(df)

# Step 1: Filter by quality (word count)
dataset = dataset.filter(WordCountFilter(min_words=10, max_words=100000))
# Removes entries 2 and 5 (too short)

# Step 2: Remove exact duplicates
deduped = ExactDuplicates(id_field="id", text_field="text")(dataset)
# Removes entry 3 (duplicate of entry 1)

# Step 3: Save curated data
deduped.to_parquet("curated_output/")
print(f"Kept {len(deduped)} documents from original {len(df)}")

Core Concepts

The Data Curation Pipeline

A production NeMo Curator pipeline follows a consistent multi-stage pattern. Each stage reduces the dataset size while increasing average quality.


from nemo_curator import ScoreFilter, Modify
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters import (
    WordCountFilter,
    RepeatedLinesFilter,
    RepeatedParagraphsFilter,
    UrlRatioFilter,
    NonAlphaNumericFilter,
    SymbolToWordRatioFilter,
)
from nemo_curator.modules import (
    ExactDuplicates,
    FuzzyDuplicates,
    SemanticDuplicates,
)
from nemo_curator.modifiers import PIIRedactor
from nemo_curator.classifiers import QualityClassifier

def build_text_curation_pipeline(dataset: DocumentDataset) -> DocumentDataset:
    """Full text curation pipeline following NVIDIA best practices."""

    # ── Stage 1: Heuristic Quality Filtering ─────────────────────
    # Remove obviously low-quality documents using fast heuristics
    dataset = dataset.filter(
        WordCountFilter(min_words=50, max_words=100000)
    )
    dataset = dataset.filter(
        RepeatedLinesFilter(max_repeated_line_fraction=0.3)
    )
    dataset = dataset.filter(
        RepeatedParagraphsFilter(max_repeated_paragraph_fraction=0.3)
    )
    dataset = dataset.filter(
        UrlRatioFilter(max_url_ratio=0.2)
    )
    dataset = dataset.filter(
        SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3)
    )
    dataset = dataset.filter(
        NonAlphaNumericFilter(max_non_alpha_numeric_ratio=0.5)
    )

    # ── Stage 2: Exact Deduplication ──────────────────────────────
    # Remove byte-identical documents (fast, O(n) with hashing)
    dataset = ExactDuplicates(
        id_field="id",
        text_field="text"
    )(dataset)

    # ── Stage 3: Fuzzy Deduplication ──────────────────────────────
    # Remove near-duplicate documents using MinHash + LSH
    # This is where GPU acceleration provides 16x speedup
    dataset = FuzzyDuplicates(
        id_field="id",
        text_field="text",
        num_hashes=260,
        num_buckets=20,
        hash_method="md5",
    )(dataset)

    # ── Stage 4: PII Redaction ────────────────────────────────────
    # Remove personally identifiable information
    pii_redactor = PIIRedactor(
        supported_entities=[
            "EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON",
            "LOCATION", "CREDIT_CARD", "SSN",
        ],
        anonymize_action="replace",
    )
    dataset = Modify(pii_redactor)(dataset)

    # ── Stage 5: Classifier-Based Quality Scoring ─────────────────
    # Use trained classifiers for nuanced quality assessment
    quality_clf = QualityClassifier(
        model_path="nvidia/quality-classifier-deberta",
        batch_size=256,
        device="cuda",
    )
    dataset = dataset.filter(
        lambda doc: quality_clf(doc["text"]) > 0.5
    )

    return dataset

Deduplication Strategies

NeMo Curator offers three levels of deduplication, each catching different types of duplicates.


# Level 1: Exact Deduplication
# Catches: identical documents (copy-paste, scrape duplicates)
# Speed: Very fast, single-pass hash comparison
from nemo_curator.modules import ExactDuplicates

exact_dedup = ExactDuplicates(
    id_field="id",
    text_field="text",
    hash_method="md5",
)

# Level 2: Fuzzy Deduplication (MinHash + LSH)
# Catches: near-identical documents (minor edits, formatting changes)
# Speed: 16x faster on GPU vs CPU for large datasets
from nemo_curator.modules import FuzzyDuplicates

fuzzy_dedup = FuzzyDuplicates(
    id_field="id",
    text_field="text",
    num_hashes=260,       # More hashes = more accurate, slower
    num_buckets=20,       # More buckets = more recall, slower
    hash_method="md5",
    jaccard_threshold=0.8, # Similarity threshold (0-1)
)

# Level 3: Semantic Deduplication
# Catches: paraphrases and semantically identical content
# Speed: Slowest (requires embedding computation), highest quality
from nemo_curator.modules import SemanticDuplicates

semantic_dedup = SemanticDuplicates(
    id_field="id",
    text_field="text",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    threshold=0.85,       # Cosine similarity threshold
    batch_size=512,
)


# ── Image Curation ─────────────────────────────────────
from nemo_curator.image import AestheticFilter, NSFWFilter, CLIPEmbedder

filtered = AestheticFilter(threshold=5.0)(image_dataset)  # LAION aesthetic score
filtered = NSFWFilter(threshold=0.9)(filtered)            # Safety filtering
embeddings = CLIPEmbedder(model="openai/clip-vit-base-patch32")(filtered)

# ── Video Curation ─────────────────────────────────────
from nemo_curator.video import SceneDetector, ClipExtractor, InternVideo2Embedder

scenes = SceneDetector(threshold=27.0)(video_dataset)
clips = ClipExtractor(min_duration=2.0, max_duration=10.0)(scenes)
video_embs = InternVideo2Embedder()(clips)

# ── Audio Curation ─────────────────────────────────────
from nemo_curator.audio import ASRInference, WERFilter, DurationFilter

transcribed = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc")(audio_dataset)
clean_audio = WERFilter(max_wer=0.3)(transcribed)
clean_audio = DurationFilter(min_duration=1.0, max_duration=30.0)(clean_audio)

GPU Cluster Setup


from nemo_curator import get_client
from dask_cuda import LocalCUDACluster

cluster = LocalCUDACluster(n_workers=8)  # 8 GPUs
client = get_client(cluster=cluster)
# All subsequent pipeline operations use the GPU cluster automatically

Configuration Reference

Parameter	Default	Description
`min_words`	50	Minimum word count for document inclusion
`max_words`	100000	Maximum word count for document inclusion
`num_hashes`	260	Number of MinHash signatures for fuzzy dedup
`num_buckets`	20	Number of LSH buckets for fuzzy dedup
`jaccard_threshold`	0.8	Similarity threshold for fuzzy duplicate detection
`semantic_threshold`	0.85	Cosine similarity threshold for semantic dedup
`batch_size`	256	Batch size for classifier inference
`max_repeated_line_fraction`	0.3	Maximum fraction of repeated lines allowed
`max_url_ratio`	0.2	Maximum ratio of URL characters to total characters
`pii_entities`	all	PII entity types to detect and redact
`anonymize_action`	"replace"	PII handling: "replace", "redact", or "hash"
`n_workers`	1	Number of GPU workers in the cluster

Performance Benchmarks

Operation	CPU (16 cores)	GPU (8x A100)	Speedup
Fuzzy dedup (8TB)	120 hours	7.5 hours	16x
Exact dedup (1TB)	8 hours	0.5 hours	16x
Quality filtering (100GB)	2 hours	0.2 hours	10x
Semantic dedup (500GB)	200 hours	15 hours	13x

Cost comparison (8TB fuzzy deduplication on AWS):

CPU-based (c5.18xlarge x 10): $36/hour x 120 hours = $4,320
GPU-based (p4d.24xlarge x 2): $65.54/hour x 7.5 hours = $491
Savings: 89% ($3,829 saved)

Best Practices

Always run quality filtering before deduplication. Filtering first reduces the dataset size, which makes deduplication faster and cheaper. Processing order matters significantly for pipeline performance.
Use all three deduplication levels in sequence: exact, then fuzzy, then semantic. Each catches different types of duplicates. Exact dedup is cheap and removes the obvious copies. Fuzzy dedup catches near-duplicates. Semantic dedup catches paraphrases. Skipping a level means those duplicates remain in your training data.
Start with CPU installation for development, switch to GPU for production. The CPU installation is simpler to set up and sufficient for testing pipelines on small data samples. Only invest in GPU infrastructure when processing production-scale data.
Monitor data reduction at each pipeline stage. Track what percentage of documents each filter removes. If a filter removes 80% of your data, investigate whether the threshold is too aggressive. If it removes 0%, it might not be needed and you should remove it to save processing time.
Use Parquet format for intermediate outputs, not JSONL. Parquet is columnar, compressed, and much faster to read/write for data processing workloads. JSONL is acceptable for initial ingestion but should be converted to Parquet early in the pipeline.
Test PII redaction on a sample and manually verify results. PII detection is imperfect. Run the redactor on a 1000-document sample and manually check that it catches real PII while not over-redacting (removing non-PII content that happens to look like PII).
Tune fuzzy dedup parameters based on your data characteristics. The default num_hashes=260 and num_buckets=20 work well for web text but may need adjustment for other domains. More hashes increase accuracy at the cost of speed. Test on a representative sample before running on the full dataset.
Use the quality classifier as a final filter, not the primary filter. Classifier inference is expensive (requires GPU). Use cheap heuristic filters first to remove obviously low-quality documents, then apply the classifier to the remaining candidates.

Troubleshooting

Problem: CUDA out-of-memory errors during fuzzy deduplication. Reduce batch_size and consider increasing n_workers to distribute the workload across more GPUs. For very large datasets, reduce num_hashes from 260 to 128 -- this trades some deduplication accuracy for lower memory consumption.

Problem: Installation fails with CUDA version mismatch. NeMo Curator requires specific CUDA versions. Check your CUDA version with nvidia-smi and install the matching package variant. For CUDA 12.x, use nemo-curator[text_cuda12]. If you have CUDA 11.x, you may need to upgrade your CUDA installation or use the CPU variant.

Problem: Pipeline is slow even on GPU. Ensure you are using the GPU-accelerated code paths by initializing a Dask CUDA cluster with get_client(cluster_type="gpu"). If you just call the modules directly without setting up a GPU client, processing falls back to CPU. Also verify your data is in Parquet format -- reading from JSONL or CSV adds significant I/O overhead.

Problem: Quality classifier rejects too many documents. The default threshold of 0.5 is calibrated for web text. For specialized domains (legal, medical, code), the classifier may score documents lower because they differ from the training distribution. Adjust the threshold downward or fine-tune the classifier on domain-specific data.

Problem: PII redactor is too aggressive, removing non-PII content. Narrow the supported_entities list to only the PII types you care about. For example, the "LOCATION" entity type often matches city names in legitimate context. Remove entity types that produce too many false positives for your data.

⚠️ Loading Issue

Comprehensive Data Processing Nemo

Comprehensive Data Processing with NeMo Curator

Overview

When to Use

Quick Start

Core Concepts

The Data Curation Pipeline

Deduplication Strategies

GPU Cluster Setup

Configuration Reference

Performance Benchmarks

Best Practices

Troubleshooting

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace

Comprehensive Data Processing Nemo

Comprehensive Data Processing with NeMo Curator

Overview

When to Use

Quick Start

Core Concepts

The Data Curation Pipeline

Deduplication Strategies

Multi-Modal Curation

GPU Cluster Setup

Configuration Reference

Performance Benchmarks

Best Practices

Troubleshooting

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace