A

Advanced Tokenization Sentencepiece

Battle-tested skill for language, independent, tokenizer, treating. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Advanced Tokenization with SentencePiece

Language-independent tokenization library that works directly on raw text without language-specific preprocessing — ideal for multilingual and CJK language models.

When to Use

Choose SentencePiece when:

  • Building multilingual models (no language-specific rules needed)
  • Working with CJK languages (Chinese, Japanese, Korean)
  • Need unsupervised tokenization from raw text
  • Models that use SentencePiece natively (T5, ALBERT, XLNet, mBART)

Consider alternatives when:

  • Building English-only models with HuggingFace → HuggingFace Tokenizers
  • Need OpenAI-compatible tokenization → tiktoken
  • BPE with byte-level fallback → HuggingFace ByteLevelBPE

Quick Start

Installation

pip install sentencepiece

Train a SentencePiece Model

import sentencepiece as spm # Train BPE model spm.SentencePieceTrainer.train( input="data/corpus.txt", model_prefix="my_tokenizer", vocab_size=32000, model_type="bpe", character_coverage=0.9995, # For CJK, use 0.9995 pad_id=0, unk_id=1, bos_id=2, eos_id=3, ) # Creates my_tokenizer.model and my_tokenizer.vocab

Use the Model

import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.load("my_tokenizer.model") # Encode text = "Hello, world! This is a test." token_ids = sp.encode(text, out_type=int) tokens = sp.encode(text, out_type=str) print(f"IDs: {token_ids}") print(f"Tokens: {tokens}") # Decode decoded = sp.decode(token_ids) print(f"Decoded: {decoded}")

Core Concepts

Model Types

TypeDescriptionBest For
bpeByte Pair EncodingGeneral purpose, most common
unigramUnigram language modelBetter subword sampling
charCharacter-levelVery small vocab, CJK
wordWord-levelPre-tokenized text

Character Coverage

# For Latin/Cyrillic languages spm.SentencePieceTrainer.train( input="corpus.txt", model_prefix="model", character_coverage=0.9995 # Default, covers most chars ) # For CJK (Chinese, Japanese, Korean) spm.SentencePieceTrainer.train( input="corpus.txt", model_prefix="model", character_coverage=0.9995, # Must be high for CJK byte_fallback=True # Handle rare characters )

Subword Regularization

# Training with subword regularization (Unigram only) spm.SentencePieceTrainer.train( input="corpus.txt", model_prefix="model", model_type="unigram", vocab_size=32000 ) # Inference with sampling (data augmentation) sp = spm.SentencePieceProcessor() sp.load("model.model") # Multiple segmentations of the same text for _ in range(5): tokens = sp.encode("machine learning", out_type=str, enable_sampling=True, alpha=0.1) print(tokens) # ["_machine", "_learn", "ing"] # ["_ma", "chine", "_learning"] # ["_machine", "_learning"]

Configuration

ParameterDefaultDescription
vocab_size8000Target vocabulary size
model_type"unigram"Algorithm (bpe, unigram, char, word)
character_coverage0.9995Character coverage rate
byte_fallbackFalseUse byte tokens for rare characters
split_digitsFalseSplit all digits into individual tokens
max_sentence_length4192Max input sentence length
num_threads16Training threads

Best Practices

  1. Use character_coverage=0.9995 for multilingual and CJK models
  2. Enable byte_fallback to handle any Unicode character without UNK tokens
  3. Set split_digits=True for math-heavy domains so each digit is a separate token
  4. Use unigram model type for subword regularization (data augmentation during training)
  5. Train on diverse data — include all languages and domains your model will encounter
  6. Match vocab_size to model size — 32K for 7B models, 64K+ for larger multilingual models

Common Issues

UNK tokens in output: Increase character_coverage or enable byte_fallback. Ensure training corpus includes all expected character sets.

Training too slow: Reduce input corpus size for initial experiments. Increase num_threads. Use input_sentence_size to limit training sentences.

Poor tokenization of domain terms: Add domain-specific text to training corpus. Use user_defined_symbols to force specific terms as single tokens.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates