What is Tiny BPE Trainer used for?

Tiny BPE Trainer allows you to train custom HuggingFace-compatible vocabularies using Byte Pair Encoding in C++. It's useful for domain-specific tokenization, LLM fine-tuning, and embedded NLP pipelines.

Is it compatible with HuggingFace Transformers?

Yes. It outputs vocab.txt and merges.txt files that can be loaded into HuggingFace Tokenizers and Transformers.

What formats does it support for training input?

It supports plain text (one document per line) and JSONL datasets, commonly used in NLP.

What makes it different from SentencePiece or HuggingFace Tokenizers?

Tiny BPE Trainer is fully written in modern C++17/20, header-only, has no external dependencies, and is ideal for on-device or embedded ML.

Can I integrate it with Modern Text Tokenizer?

Yes! The output vocab.txt can be used directly with the Modern Text Tokenizer project for full C++ inference pipelines.

Tiny BPE Trainer – A Fast and Lightweight BPE Trainer in C++

Introducing Tiny BPE Trainer

Most modern NLP models today from GPT to RoBERTa, rely on subword tokenization using Byte Pair Encoding (BPE). But what if you want to train your own vocabulary in pure C++?

Meet Tiny BPE Trainer - a blazing-fast, header-only BPE trainer written in modern C++17/20, with zero dependencies, full UTF-8 support, and HuggingFace-compatible output (vocab.txt, merges.txt).

Why Another BPE Trainer?

Because existing options are often:

Python-only, with heavy runtime dependencies (Rust, Protobuf, etc.)
Not easily embeddable in C++ applications
Not focused on speed, simplicity, or cross-platform usage

Tiny BPE Trainer is:

Header-only
Cross-platform (Linux, Windows, macOS)
HuggingFace-compatible
And it pairs perfectly with my other project: Modern Text Tokenizer

Core Features

Full BPE Training from plain text or JSONL datasets
CLI and C++ API support – ideal for tooling or embedding
HuggingFace-Compatible Output (vocab.txt, merges.txt)
UTF-8 Safe – handles emojis, multilingual scripts, special characters
Configurable – lowercase, punctuation splitting, min frequency, etc.
Demo Mode – test everything with a one-line command

Demo

1./Tiny-BPE-Trainer --demo

This generates a synthetic corpus, trains a vocabulary, prints stats, and runs tokenization, all in seconds. Great for CI and smoke tests.

How to Train a BPE Tokenizer

Build it

1g++ -std=c++17 -O3 -o Tiny-BPE-Trainer Tiny-BPE-Trainer.cpp

Train on a corpus

1./Tiny-BPE-Trainer -i corpus.txt -v 16000 -o my_tokenizer

Or from JSONL:

1./Tiny-BPE-Trainer -i dataset.jsonl --jsonl -v 32000

Tokenize text

1./Tiny-BPE-Trainer --test "Hello, world! This is a test."

Works Seamlessly with Modern Text Tokenizer

Once trained, you can use your custom vocab directly in my Modern Text Tokenizer :

1TextTokenizer tokenizer;
2tokenizer.load_vocab("my_tokenizer_vocab.txt");
3auto ids = tokenizer.encode("Hello world!");

This gives you a fully C++ tokenizer pipeline, with zero runtime dependencies.

Use Real Datasets

You can easily generate corpora using HuggingFace datasets:

1from datasets import load_dataset
2
3dataset = load_dataset("imdb", split="train")
4with open("corpus.txt", "w", encoding="utf-8") as f:
5for x in dataset:
6f.write(x["text"].strip().replace("\n", " ") + "\n")

Then train:

1./Tiny-BPE-Trainer -i corpus.txt -v 16000 -o imdb_tokenizer

Benchmark

1Processed: 33M characters
2Unique words: 106K
3Vocab size: 32000
4Training time: ~30 mins (Ryzen 9, -O3)

Even on large corpus files (IMDB, WikiText), Tiny BPE Trainer performs efficiently and predictably and generates vocabularies compatible with HuggingFace, SentencePiece, and your own C++ tokenizers.

Use Cases

Training custom tokenizers for LLMs and transformers
On-device NLP where Python isn’t available
Building high-performance preprocessors
Training domain-specific vocabularies (legal, medical, code)

Try It Now

1git clone https://github.com/Mecanik/Tiny-BPE-Trainer
2cd Tiny-BPE-Trainer
3g++ -std=c++17 -O3 -o Tiny-BPE-Trainer Tiny-BPE-Trainer.cpp
4./Tiny-BPE-Trainer --demo

Check out the full README on GitHub for advanced options, CLI flags, and integration tips:

Tiny BPE Trainer on GitHub

Built with ❤️ for the NLP and C++ community.

If you like it, star it on GitHub , use it in your projects, or contribute!

Tiny BPE Trainer – A Fast and Lightweight BPE Trainer in C++

Introducing Tiny BPE Trainer

Why Another BPE Trainer?

Core Features

Demo

How to Train a BPE Tokenizer

Build it

Train on a corpus

Tokenize text

Works Seamlessly with Modern Text Tokenizer

Use Real Datasets

Benchmark

Use Cases

Try It Now

Built with ❤️ for the NLP and C++ community.

Comments

Leave a comment

Tiny BPE Trainer – A Fast and Lightweight BPE Trainer in C++

Introducing Tiny BPE Trainer

Why Another BPE Trainer?

Core Features

Demo

How to Train a BPE Tokenizer

Build it

Train on a corpus

Tokenize text

Works Seamlessly with Modern Text Tokenizer

Use Real Datasets

Benchmark

Use Cases

Try It Now

Built with ❤️ for the NLP and C++ community.

Related Posts

Comments

Leave a comment Cancel reply

Leave a comment