Advanced Tokenization Sentencepiece
Battle-tested skill for language, independent, tokenizer, treating. Includes structured workflows, validation checks, and reusable patterns for ai research.
Advanced Tokenization with SentencePiece
Language-independent tokenization library that works directly on raw text without language-specific preprocessing — ideal for multilingual and CJK language models.
When to Use
Choose SentencePiece when:
- Building multilingual models (no language-specific rules needed)
- Working with CJK languages (Chinese, Japanese, Korean)
- Need unsupervised tokenization from raw text
- Models that use SentencePiece natively (T5, ALBERT, XLNet, mBART)
Consider alternatives when:
- Building English-only models with HuggingFace → HuggingFace Tokenizers
- Need OpenAI-compatible tokenization → tiktoken
- BPE with byte-level fallback → HuggingFace ByteLevelBPE
Quick Start
Installation
pip install sentencepiece
Train a SentencePiece Model
import sentencepiece as spm # Train BPE model spm.SentencePieceTrainer.train( input="data/corpus.txt", model_prefix="my_tokenizer", vocab_size=32000, model_type="bpe", character_coverage=0.9995, # For CJK, use 0.9995 pad_id=0, unk_id=1, bos_id=2, eos_id=3, ) # Creates my_tokenizer.model and my_tokenizer.vocab
Use the Model
import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.load("my_tokenizer.model") # Encode text = "Hello, world! This is a test." token_ids = sp.encode(text, out_type=int) tokens = sp.encode(text, out_type=str) print(f"IDs: {token_ids}") print(f"Tokens: {tokens}") # Decode decoded = sp.decode(token_ids) print(f"Decoded: {decoded}")
Core Concepts
Model Types
| Type | Description | Best For |
|---|---|---|
bpe | Byte Pair Encoding | General purpose, most common |
unigram | Unigram language model | Better subword sampling |
char | Character-level | Very small vocab, CJK |
word | Word-level | Pre-tokenized text |
Character Coverage
# For Latin/Cyrillic languages spm.SentencePieceTrainer.train( input="corpus.txt", model_prefix="model", character_coverage=0.9995 # Default, covers most chars ) # For CJK (Chinese, Japanese, Korean) spm.SentencePieceTrainer.train( input="corpus.txt", model_prefix="model", character_coverage=0.9995, # Must be high for CJK byte_fallback=True # Handle rare characters )
Subword Regularization
# Training with subword regularization (Unigram only) spm.SentencePieceTrainer.train( input="corpus.txt", model_prefix="model", model_type="unigram", vocab_size=32000 ) # Inference with sampling (data augmentation) sp = spm.SentencePieceProcessor() sp.load("model.model") # Multiple segmentations of the same text for _ in range(5): tokens = sp.encode("machine learning", out_type=str, enable_sampling=True, alpha=0.1) print(tokens) # ["_machine", "_learn", "ing"] # ["_ma", "chine", "_learning"] # ["_machine", "_learning"]
Configuration
| Parameter | Default | Description |
|---|---|---|
vocab_size | 8000 | Target vocabulary size |
model_type | "unigram" | Algorithm (bpe, unigram, char, word) |
character_coverage | 0.9995 | Character coverage rate |
byte_fallback | False | Use byte tokens for rare characters |
split_digits | False | Split all digits into individual tokens |
max_sentence_length | 4192 | Max input sentence length |
num_threads | 16 | Training threads |
Best Practices
- Use character_coverage=0.9995 for multilingual and CJK models
- Enable byte_fallback to handle any Unicode character without UNK tokens
- Set split_digits=True for math-heavy domains so each digit is a separate token
- Use unigram model type for subword regularization (data augmentation during training)
- Train on diverse data — include all languages and domains your model will encounter
- Match vocab_size to model size — 32K for 7B models, 64K+ for larger multilingual models
Common Issues
UNK tokens in output:
Increase character_coverage or enable byte_fallback. Ensure training corpus includes all expected character sets.
Training too slow:
Reduce input corpus size for initial experiments. Increase num_threads. Use input_sentence_size to limit training sentences.
Poor tokenization of domain terms:
Add domain-specific text to training corpus. Use user_defined_symbols to force specific terms as single tokens.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.