C

Comprehensive Data Processing Nemo

Powerful skill for accelerated, data, curation, training. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Comprehensive Data Processing with NeMo Curator

Overview

NeMo Curator is NVIDIA's GPU-accelerated toolkit for preparing high-quality training data for large language models. Data quality is the single most impactful factor in LLM performance, and NeMo Curator provides the tools to transform raw web scrapes into clean, deduplicated, high-quality training corpora at scale. It delivers 16x faster fuzzy deduplication compared to CPU-based alternatives, 40% lower total cost of ownership, and near-linear scaling across GPU nodes.

The toolkit covers the complete data curation pipeline: quality filtering with 30+ heuristic filters, three levels of deduplication (exact, fuzzy via MinHash+LSH, and semantic via embeddings), PII redaction, language identification, toxicity filtering, and classifier-based quality scoring. Beyond text, NeMo Curator supports multi-modal data curation for images (aesthetic scoring, NSFW detection, CLIP embeddings), video (scene detection, clip extraction), and audio (ASR transcription, word error rate filtering).

NVIDIA used NeMo Curator internally to prepare training data for the Nemotron-4 model family, and the toolkit has been used by the open-source community to curate datasets like RedPajama v2 and subsets of The Pile. It is the production-grade choice for teams that need to process terabytes of data on GPU infrastructure.

When to Use

  • Preparing LLM training data from web scrapes like Common Crawl or WARC files
  • Deduplicating large text corpora where CPU-based deduplication is too slow
  • Building data curation pipelines that need to scale to terabytes of data
  • Filtering low-quality, toxic, or NSFW content from training datasets
  • Redacting personally identifiable information (PII) before model training
  • Curating multi-modal datasets combining text, images, video, and audio
  • Benchmarking data quality to determine impact on downstream model performance
  • Running GPU-accelerated NLP processing across a cluster of machines

Quick Start

# Text curation with CUDA 12 (recommended for GPU users) uv pip install "nemo-curator[text_cuda12]" # All modalities (text + image + video + audio) uv pip install "nemo-curator[all_cuda12]" # CPU-only installation (slower but no GPU required) uv pip install "nemo-curator[cpu]" # Verify installation python -c "import nemo_curator; print(nemo_curator.__version__)"
# Minimal working pipeline: load, filter, deduplicate, save from nemo_curator import ScoreFilter from nemo_curator.datasets import DocumentDataset from nemo_curator.filters import WordCountFilter from nemo_curator.modules import ExactDuplicates import pandas as pd # Load raw data df = pd.DataFrame({ "id": [1, 2, 3, 4, 5], "text": [ "This is a high-quality document with substantial content that provides " "real value to readers and contains enough words to pass filtering.", "bad", # Too short, will be filtered "This is a high-quality document with substantial content that provides " "real value to readers and contains enough words to pass filtering.", # Duplicate "Another excellent piece of content with diverse vocabulary and " "meaningful information that adds to the training corpus.", "short", # Too short ] }) dataset = DocumentDataset(df) # Step 1: Filter by quality (word count) dataset = dataset.filter(WordCountFilter(min_words=10, max_words=100000)) # Removes entries 2 and 5 (too short) # Step 2: Remove exact duplicates deduped = ExactDuplicates(id_field="id", text_field="text")(dataset) # Removes entry 3 (duplicate of entry 1) # Step 3: Save curated data deduped.to_parquet("curated_output/") print(f"Kept {len(deduped)} documents from original {len(df)}")

Core Concepts

The Data Curation Pipeline

A production NeMo Curator pipeline follows a consistent multi-stage pattern. Each stage reduces the dataset size while increasing average quality.

from nemo_curator import ScoreFilter, Modify from nemo_curator.datasets import DocumentDataset from nemo_curator.filters import ( WordCountFilter, RepeatedLinesFilter, RepeatedParagraphsFilter, UrlRatioFilter, NonAlphaNumericFilter, SymbolToWordRatioFilter, ) from nemo_curator.modules import ( ExactDuplicates, FuzzyDuplicates, SemanticDuplicates, ) from nemo_curator.modifiers import PIIRedactor from nemo_curator.classifiers import QualityClassifier def build_text_curation_pipeline(dataset: DocumentDataset) -> DocumentDataset: """Full text curation pipeline following NVIDIA best practices.""" # ── Stage 1: Heuristic Quality Filtering ───────────────────── # Remove obviously low-quality documents using fast heuristics dataset = dataset.filter( WordCountFilter(min_words=50, max_words=100000) ) dataset = dataset.filter( RepeatedLinesFilter(max_repeated_line_fraction=0.3) ) dataset = dataset.filter( RepeatedParagraphsFilter(max_repeated_paragraph_fraction=0.3) ) dataset = dataset.filter( UrlRatioFilter(max_url_ratio=0.2) ) dataset = dataset.filter( SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3) ) dataset = dataset.filter( NonAlphaNumericFilter(max_non_alpha_numeric_ratio=0.5) ) # ── Stage 2: Exact Deduplication ────────────────────────────── # Remove byte-identical documents (fast, O(n) with hashing) dataset = ExactDuplicates( id_field="id", text_field="text" )(dataset) # ── Stage 3: Fuzzy Deduplication ────────────────────────────── # Remove near-duplicate documents using MinHash + LSH # This is where GPU acceleration provides 16x speedup dataset = FuzzyDuplicates( id_field="id", text_field="text", num_hashes=260, num_buckets=20, hash_method="md5", )(dataset) # ── Stage 4: PII Redaction ──────────────────────────────────── # Remove personally identifiable information pii_redactor = PIIRedactor( supported_entities=[ "EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION", "CREDIT_CARD", "SSN", ], anonymize_action="replace", ) dataset = Modify(pii_redactor)(dataset) # ── Stage 5: Classifier-Based Quality Scoring ───────────────── # Use trained classifiers for nuanced quality assessment quality_clf = QualityClassifier( model_path="nvidia/quality-classifier-deberta", batch_size=256, device="cuda", ) dataset = dataset.filter( lambda doc: quality_clf(doc["text"]) > 0.5 ) return dataset

Deduplication Strategies

NeMo Curator offers three levels of deduplication, each catching different types of duplicates.

# Level 1: Exact Deduplication # Catches: identical documents (copy-paste, scrape duplicates) # Speed: Very fast, single-pass hash comparison from nemo_curator.modules import ExactDuplicates exact_dedup = ExactDuplicates( id_field="id", text_field="text", hash_method="md5", ) # Level 2: Fuzzy Deduplication (MinHash + LSH) # Catches: near-identical documents (minor edits, formatting changes) # Speed: 16x faster on GPU vs CPU for large datasets from nemo_curator.modules import FuzzyDuplicates fuzzy_dedup = FuzzyDuplicates( id_field="id", text_field="text", num_hashes=260, # More hashes = more accurate, slower num_buckets=20, # More buckets = more recall, slower hash_method="md5", jaccard_threshold=0.8, # Similarity threshold (0-1) ) # Level 3: Semantic Deduplication # Catches: paraphrases and semantically identical content # Speed: Slowest (requires embedding computation), highest quality from nemo_curator.modules import SemanticDuplicates semantic_dedup = SemanticDuplicates( id_field="id", text_field="text", embedding_model="sentence-transformers/all-MiniLM-L6-v2", threshold=0.85, # Cosine similarity threshold batch_size=512, )

Multi-Modal Curation

# ── Image Curation ───────────────────────────────────── from nemo_curator.image import AestheticFilter, NSFWFilter, CLIPEmbedder filtered = AestheticFilter(threshold=5.0)(image_dataset) # LAION aesthetic score filtered = NSFWFilter(threshold=0.9)(filtered) # Safety filtering embeddings = CLIPEmbedder(model="openai/clip-vit-base-patch32")(filtered) # ── Video Curation ───────────────────────────────────── from nemo_curator.video import SceneDetector, ClipExtractor, InternVideo2Embedder scenes = SceneDetector(threshold=27.0)(video_dataset) clips = ClipExtractor(min_duration=2.0, max_duration=10.0)(scenes) video_embs = InternVideo2Embedder()(clips) # ── Audio Curation ───────────────────────────────────── from nemo_curator.audio import ASRInference, WERFilter, DurationFilter transcribed = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc")(audio_dataset) clean_audio = WERFilter(max_wer=0.3)(transcribed) clean_audio = DurationFilter(min_duration=1.0, max_duration=30.0)(clean_audio)

GPU Cluster Setup

from nemo_curator import get_client from dask_cuda import LocalCUDACluster cluster = LocalCUDACluster(n_workers=8) # 8 GPUs client = get_client(cluster=cluster) # All subsequent pipeline operations use the GPU cluster automatically

Configuration Reference

ParameterDefaultDescription
min_words50Minimum word count for document inclusion
max_words100000Maximum word count for document inclusion
num_hashes260Number of MinHash signatures for fuzzy dedup
num_buckets20Number of LSH buckets for fuzzy dedup
jaccard_threshold0.8Similarity threshold for fuzzy duplicate detection
semantic_threshold0.85Cosine similarity threshold for semantic dedup
batch_size256Batch size for classifier inference
max_repeated_line_fraction0.3Maximum fraction of repeated lines allowed
max_url_ratio0.2Maximum ratio of URL characters to total characters
pii_entitiesallPII entity types to detect and redact
anonymize_action"replace"PII handling: "replace", "redact", or "hash"
n_workers1Number of GPU workers in the cluster

Performance Benchmarks

OperationCPU (16 cores)GPU (8x A100)Speedup
Fuzzy dedup (8TB)120 hours7.5 hours16x
Exact dedup (1TB)8 hours0.5 hours16x
Quality filtering (100GB)2 hours0.2 hours10x
Semantic dedup (500GB)200 hours15 hours13x

Cost comparison (8TB fuzzy deduplication on AWS):

  • CPU-based (c5.18xlarge x 10): $36/hour x 120 hours = $4,320
  • GPU-based (p4d.24xlarge x 2): $65.54/hour x 7.5 hours = $491
  • Savings: 89% ($3,829 saved)

Best Practices

  1. Always run quality filtering before deduplication. Filtering first reduces the dataset size, which makes deduplication faster and cheaper. Processing order matters significantly for pipeline performance.

  2. Use all three deduplication levels in sequence: exact, then fuzzy, then semantic. Each catches different types of duplicates. Exact dedup is cheap and removes the obvious copies. Fuzzy dedup catches near-duplicates. Semantic dedup catches paraphrases. Skipping a level means those duplicates remain in your training data.

  3. Start with CPU installation for development, switch to GPU for production. The CPU installation is simpler to set up and sufficient for testing pipelines on small data samples. Only invest in GPU infrastructure when processing production-scale data.

  4. Monitor data reduction at each pipeline stage. Track what percentage of documents each filter removes. If a filter removes 80% of your data, investigate whether the threshold is too aggressive. If it removes 0%, it might not be needed and you should remove it to save processing time.

  5. Use Parquet format for intermediate outputs, not JSONL. Parquet is columnar, compressed, and much faster to read/write for data processing workloads. JSONL is acceptable for initial ingestion but should be converted to Parquet early in the pipeline.

  6. Test PII redaction on a sample and manually verify results. PII detection is imperfect. Run the redactor on a 1000-document sample and manually check that it catches real PII while not over-redacting (removing non-PII content that happens to look like PII).

  7. Tune fuzzy dedup parameters based on your data characteristics. The default num_hashes=260 and num_buckets=20 work well for web text but may need adjustment for other domains. More hashes increase accuracy at the cost of speed. Test on a representative sample before running on the full dataset.

  8. Use the quality classifier as a final filter, not the primary filter. Classifier inference is expensive (requires GPU). Use cheap heuristic filters first to remove obviously low-quality documents, then apply the classifier to the remaining candidates.

Troubleshooting

Problem: CUDA out-of-memory errors during fuzzy deduplication. Reduce batch_size and consider increasing n_workers to distribute the workload across more GPUs. For very large datasets, reduce num_hashes from 260 to 128 -- this trades some deduplication accuracy for lower memory consumption.

Problem: Installation fails with CUDA version mismatch. NeMo Curator requires specific CUDA versions. Check your CUDA version with nvidia-smi and install the matching package variant. For CUDA 12.x, use nemo-curator[text_cuda12]. If you have CUDA 11.x, you may need to upgrade your CUDA installation or use the CPU variant.

Problem: Pipeline is slow even on GPU. Ensure you are using the GPU-accelerated code paths by initializing a Dask CUDA cluster with get_client(cluster_type="gpu"). If you just call the modules directly without setting up a GPU client, processing falls back to CPU. Also verify your data is in Parquet format -- reading from JSONL or CSV adds significant I/O overhead.

Problem: Quality classifier rejects too many documents. The default threshold of 0.5 is calibrated for web text. For specialized domains (legal, medical, code), the classifier may score documents lower because they differ from the training distribution. Adjust the threshold downward or fine-tune the classifier on domain-specific data.

Problem: PII redactor is too aggressive, removing non-PII content. Narrow the supported_entities list to only the PII types you care about. For example, the "LOCATION" entity type often matches city names in legitimate context. Remove entity types that produce too many false positives for your data.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates