Advanced Tokenization with SentencePiece

Language-independent tokenization library that works directly on raw text without language-specific preprocessing — ideal for multilingual and CJK language models.

When to Use

Choose SentencePiece when:

Building multilingual models (no language-specific rules needed)
Working with CJK languages (Chinese, Japanese, Korean)
Need unsupervised tokenization from raw text
Models that use SentencePiece natively (T5, ALBERT, XLNet, mBART)

Consider alternatives when:

Building English-only models with HuggingFace → HuggingFace Tokenizers
Need OpenAI-compatible tokenization → tiktoken
BPE with byte-level fallback → HuggingFace ByteLevelBPE

Quick Start

Installation


pip install sentencepiece

Train a SentencePiece Model


import sentencepiece as spm

# Train BPE model
spm.SentencePieceTrainer.train(
    input="data/corpus.txt",
    model_prefix="my_tokenizer",
    vocab_size=32000,
    model_type="bpe",
    character_coverage=0.9995,  # For CJK, use 0.9995
    pad_id=0,
    unk_id=1,
    bos_id=2,
    eos_id=3,
)

# Creates my_tokenizer.model and my_tokenizer.vocab

Use the Model


import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("my_tokenizer.model")

# Encode
text = "Hello, world! This is a test."
token_ids = sp.encode(text, out_type=int)
tokens = sp.encode(text, out_type=str)

print(f"IDs: {token_ids}")
print(f"Tokens: {tokens}")

# Decode
decoded = sp.decode(token_ids)
print(f"Decoded: {decoded}")

Core Concepts

Model Types

Type	Description	Best For
`bpe`	Byte Pair Encoding	General purpose, most common
`unigram`	Unigram language model	Better subword sampling
`char`	Character-level	Very small vocab, CJK
`word`	Word-level	Pre-tokenized text

Character Coverage


# For Latin/Cyrillic languages
spm.SentencePieceTrainer.train(
    input="corpus.txt",
    model_prefix="model",
    character_coverage=0.9995  # Default, covers most chars
)

# For CJK (Chinese, Japanese, Korean)
spm.SentencePieceTrainer.train(
    input="corpus.txt",
    model_prefix="model",
    character_coverage=0.9995,  # Must be high for CJK
    byte_fallback=True          # Handle rare characters
)

Subword Regularization


# Training with subword regularization (Unigram only)
spm.SentencePieceTrainer.train(
    input="corpus.txt",
    model_prefix="model",
    model_type="unigram",
    vocab_size=32000
)

# Inference with sampling (data augmentation)
sp = spm.SentencePieceProcessor()
sp.load("model.model")

# Multiple segmentations of the same text
for _ in range(5):
    tokens = sp.encode("machine learning", out_type=str, enable_sampling=True, alpha=0.1)
    print(tokens)
# ["_machine", "_learn", "ing"]
# ["_ma", "chine", "_learning"]
# ["_machine", "_learning"]

Configuration

Parameter	Default	Description
`vocab_size`	8000	Target vocabulary size
`model_type`	"unigram"	Algorithm (bpe, unigram, char, word)
`character_coverage`	0.9995	Character coverage rate
`byte_fallback`	False	Use byte tokens for rare characters
`split_digits`	False	Split all digits into individual tokens
`max_sentence_length`	4192	Max input sentence length
`num_threads`	16	Training threads

Best Practices

Use character_coverage=0.9995 for multilingual and CJK models
Enable byte_fallback to handle any Unicode character without UNK tokens
Set split_digits=True for math-heavy domains so each digit is a separate token
Use unigram model type for subword regularization (data augmentation during training)
Train on diverse data — include all languages and domains your model will encounter
Match vocab_size to model size — 32K for 7B models, 64K+ for larger multilingual models

Common Issues

UNK tokens in output: Increase character_coverage or enable byte_fallback. Ensure training corpus includes all expected character sets.

Training too slow: Reduce input corpus size for initial experiments. Increase num_threads. Use input_sentence_size to limit training sentences.

Poor tokenization of domain terms: Add domain-specific text to training corpus. Use user_defined_symbols to force specific terms as single tokens.

⚠️ Loading Issue

Advanced Tokenization Sentencepiece

Advanced Tokenization with SentencePiece

When to Use

Quick Start

Installation

Train a SentencePiece Model

Use the Model

Core Concepts

Model Types

Character Coverage

Subword Regularization

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace