Introducing Tiny BPE Trainer
Most modern NLP models today from GPT to RoBERTa, rely on subword tokenization using Byte Pair Encoding (BPE). But what if you want to train your own vocabulary in pure C++?
Meet Tiny BPE Trainer - a blazing-fast, header-only BPE trainer written in modern C++17/20, with zero dependencies, full UTF-8 support, and HuggingFace-compatible output (vocab.txt
, merges.txt
).
Why Another BPE Trainer?
Because existing options are often:
- Python-only, with heavy runtime dependencies (Rust, Protobuf, etc.)
- Not easily embeddable in C++ applications
- Not focused on speed, simplicity, or cross-platform usage
Tiny BPE Trainer is:
- Header-only
- Cross-platform (Linux, Windows, macOS)
- HuggingFace-compatible
- And it pairs perfectly with my other project: Modern Text Tokenizer
Core Features
- Full BPE Training from plain text or JSONL datasets
- CLI and C++ API support – ideal for tooling or embedding
- HuggingFace-Compatible Output (
vocab.txt
,merges.txt
) - UTF-8 Safe – handles emojis, multilingual scripts, special characters
- Configurable – lowercase, punctuation splitting, min frequency, etc.
- Demo Mode – test everything with a one-line command
Demo
1./Tiny-BPE-Trainer --demo
This generates a synthetic corpus, trains a vocabulary, prints stats, and runs tokenization — all in seconds. Great for CI and smoke tests.
How to Train a BPE Tokenizer
Build it
1g++ -std=c++17 -O3 -o Tiny-BPE-Trainer Tiny-BPE-Trainer.cpp
Train on a corpus
1./Tiny-BPE-Trainer -i corpus.txt -v 16000 -o my_tokenizer
Or from JSONL:
1./Tiny-BPE-Trainer -i dataset.jsonl --jsonl -v 32000
Tokenize text
1./Tiny-BPE-Trainer --test "Hello, world! This is a test."
Works Seamlessly with Modern Text Tokenizer
Once trained, you can use your custom vocab directly in my Modern Text Tokenizer :
1TextTokenizer tokenizer;
2tokenizer.load_vocab("my_tokenizer_vocab.txt");
3auto ids = tokenizer.encode("Hello world!");
This gives you a fully C++ tokenizer pipeline, with zero runtime dependencies.
Use Real Datasets
You can easily generate corpora using HuggingFace datasets:
1from datasets import load_dataset
2
3dataset = load_dataset("imdb", split="train")
4with open("corpus.txt", "w", encoding="utf-8") as f:
5 for x in dataset:
6 f.write(x["text"].strip().replace("\n", " ") + "\n")
Then train:
1./Tiny-BPE-Trainer -i corpus.txt -v 16000 -o imdb_tokenizer
Benchmark
1Processed: 33M characters
2Unique words: 106K
3Vocab size: 32000
4Training time: ~30 mins (Ryzen 9, -O3)
Even on large corpus files (IMDB, WikiText), Tiny BPE Trainer performs efficiently and predictably — and generates vocabularies compatible with HuggingFace, SentencePiece, and your own C++ tokenizers.
Use Cases
- Training custom tokenizers for LLMs and transformers
- On-device NLP where Python isn’t available
- Building high-performance preprocessors
- Training domain-specific vocabularies (legal, medical, code)
Try It Now
1git clone https://github.com/Mecanik/Tiny-BPE-Trainer
2cd Tiny-BPE-Trainer
3g++ -std=c++17 -O3 -o Tiny-BPE-Trainer Tiny-BPE-Trainer.cpp
4./Tiny-BPE-Trainer --demo
Check out the full README on GitHub for advanced options, CLI flags, and integration tips:
Built with ❤️ for the NLP and C++ community.
If you like it, star it on GitHub , use it in your projects, or contribute!
Comments