Introducing Modern Text Tokenizer
Modern natural language processing (NLP) models like BERT, DistilBERT, and other transformer-based architectures rely heavily on effective tokenization. But C++ developers often face limited options like bloated dependencies, poor Unicode support, or lack of compatibility with vocab-based encoders.
That’s why I created Modern Text Tokenizer - a blazing-fast, header-only C++ tokenizer that’s UTF-8 aware, zero-dependency, and ML-ready out of the box.
What Makes It Unique?
- Zero Dependencies – No Boost, no ICU, no external libs.
- UTF-8 Safe – Correctly handles multilingual text, emojis, and multibyte characters.
- Header-Only – Drop it into your project and go.
- Vocabulary Encoding – Load
vocab.txt
from HuggingFace and generate token IDs. - Transformer-Ready – Supports
[CLS]
,[SEP]
,[PAD]
, and sequence formatting.
Key Features
- Fast ASCII vs Unicode branching using
std::string_view
- Fluent API for configuration:
1 TextTokenizer tokenizer; 2 tokenizer 3 .set_lowercase(true) 4 .set_split_on_punctuation(true) 5 .set_keep_punctuation(true);
- Load vocabularies:
1 tokenizer.load_vocab("vocab.txt");
- Encode / Decode:
1 auto ids = tokenizer.encode("Hello world!"); 2 std::string decoded = tokenizer.decode(ids);
Performance
1Performance test with 174000 characters
2
3Results:
4 Tokenization: 2159 μs (22000 tokens)
5 Encoding: 1900 μs
6 Decoding: 430 μs
7 Total time: 4.49 ms
8 Throughput: 36.97 MB/s
Benchmarked on Ryzen 9 5900X @ -O3 in release mode.
How to Use
Add the header file:
1#include "Modern-Text-Tokenizer.hpp"
Then compile:
1g++ -std=c++17 -O3 -o tokenizer_demo main.cpp
Want to use it with BERT or DistilBERT? Just download the vocab file:
1curl -O https://huggingface.co/distilbert/distilbert-base-uncased/raw/main/vocab.txt
Cross-Platform CI Builds
OS | Status |
---|---|
Ubuntu | ✅ |
Windows | ✅ |
GitHub Actions |
Use Cases
- Text pre-processing for ML models in C++
- On-device NLP (no Python overhead)
- High-performance CLI tools
- Embedded systems with no runtime dependencies
Try It Now
Modern Text Tokenizer is live and ready for your projects.
Clone, compile, and tokenize in seconds: Modern Text Tokenizer
Comments