Introducing Tiny BPE Trainer

Most modern NLP models today from GPT to RoBERTa, rely on subword tokenization using Byte Pair Encoding (BPE). But what if you want to train your own vocabulary in pure C++?

Meet Tiny BPE Trainer - a blazing-fast, header-only BPE trainer written in modern C++17/20, with zero dependencies, full UTF-8 support, and HuggingFace-compatible output (vocab.txt, merges.txt).

Why Another BPE Trainer?

Because existing options are often:

  • Python-only, with heavy runtime dependencies (Rust, Protobuf, etc.)
  • Not easily embeddable in C++ applications
  • Not focused on speed, simplicity, or cross-platform usage

Tiny BPE Trainer is:

  • Header-only
  • Cross-platform (Linux, Windows, macOS)
  • HuggingFace-compatible
  • And it pairs perfectly with my other project: Modern Text Tokenizer

Core Features

  • Full BPE Training from plain text or JSONL datasets
  • CLI and C++ API support – ideal for tooling or embedding
  • HuggingFace-Compatible Output (vocab.txt, merges.txt)
  • UTF-8 Safe – handles emojis, multilingual scripts, special characters
  • Configurable – lowercase, punctuation splitting, min frequency, etc.
  • Demo Mode – test everything with a one-line command

Demo

1./Tiny-BPE-Trainer --demo

This generates a synthetic corpus, trains a vocabulary, prints stats, and runs tokenization — all in seconds. Great for CI and smoke tests.

How to Train a BPE Tokenizer

Build it

1g++ -std=c++17 -O3 -o Tiny-BPE-Trainer Tiny-BPE-Trainer.cpp

Train on a corpus

1./Tiny-BPE-Trainer -i corpus.txt -v 16000 -o my_tokenizer

Or from JSONL:

1./Tiny-BPE-Trainer -i dataset.jsonl --jsonl -v 32000

Tokenize text

1./Tiny-BPE-Trainer --test "Hello, world! This is a test."

Works Seamlessly with Modern Text Tokenizer

Once trained, you can use your custom vocab directly in my Modern Text Tokenizer :

1TextTokenizer tokenizer;
2tokenizer.load_vocab("my_tokenizer_vocab.txt");
3auto ids = tokenizer.encode("Hello world!");

This gives you a fully C++ tokenizer pipeline, with zero runtime dependencies.

Use Real Datasets

You can easily generate corpora using HuggingFace datasets:

1from datasets import load_dataset
2
3dataset = load_dataset("imdb", split="train")
4with open("corpus.txt", "w", encoding="utf-8") as f:
5    for x in dataset:
6        f.write(x["text"].strip().replace("\n", " ") + "\n")

Then train:

1./Tiny-BPE-Trainer -i corpus.txt -v 16000 -o imdb_tokenizer

Benchmark

1Processed: 33M characters
2Unique words: 106K
3Vocab size: 32000
4Training time: ~30 mins (Ryzen 9, -O3)

Even on large corpus files (IMDB, WikiText), Tiny BPE Trainer performs efficiently and predictably — and generates vocabularies compatible with HuggingFace, SentencePiece, and your own C++ tokenizers.

Use Cases

  • Training custom tokenizers for LLMs and transformers
  • On-device NLP where Python isn’t available
  • Building high-performance preprocessors
  • Training domain-specific vocabularies (legal, medical, code)

Try It Now

1git clone https://github.com/Mecanik/Tiny-BPE-Trainer
2cd Tiny-BPE-Trainer
3g++ -std=c++17 -O3 -o Tiny-BPE-Trainer Tiny-BPE-Trainer.cpp
4./Tiny-BPE-Trainer --demo

Check out the full README on GitHub for advanced options, CLI flags, and integration tips:

Tiny BPE Trainer on GitHub

Built with ❤️ for the NLP and C++ community.

If you like it, star it on GitHub , use it in your projects, or contribute!