Introducing Modern Text Tokenizer

Modern natural language processing (NLP) models like BERT, DistilBERT, and other transformer-based architectures rely heavily on effective tokenization. But C++ developers often face limited options like bloated dependencies, poor Unicode support, or lack of compatibility with vocab-based encoders.

That’s why I created Modern Text Tokenizer - a blazing-fast, header-only C++ tokenizer that’s UTF-8 aware, zero-dependency, and ML-ready out of the box.

What Makes It Unique?

  • Zero Dependencies – No Boost, no ICU, no external libs.
  • UTF-8 Safe – Correctly handles multilingual text, emojis, and multibyte characters.
  • Header-Only – Drop it into your project and go.
  • Vocabulary Encoding – Load vocab.txt from HuggingFace and generate token IDs.
  • Transformer-Ready – Supports [CLS], [SEP], [PAD], and sequence formatting.

Key Features

  • Fast ASCII vs Unicode branching using std::string_view
  • Fluent API for configuration:
    1  TextTokenizer tokenizer;
    2  tokenizer
    3    .set_lowercase(true)
    4    .set_split_on_punctuation(true)
    5    .set_keep_punctuation(true);
  • Load vocabularies:
    1  tokenizer.load_vocab("vocab.txt");
  • Encode / Decode:
    1  auto ids = tokenizer.encode("Hello world!");
    2  std::string decoded = tokenizer.decode(ids);

Performance

1Performance test with 174000 characters
2
3Results:
4  Tokenization: 2159 μs (22000 tokens)
5  Encoding:     1900 μs
6  Decoding:     430 μs
7  Total time:   4.49 ms
8  Throughput:   36.97 MB/s

Benchmarked on Ryzen 9 5900X @ -O3 in release mode.

How to Use

Add the header file:

1#include "Modern-Text-Tokenizer.hpp"

Then compile:

1g++ -std=c++17 -O3 -o tokenizer_demo main.cpp

Want to use it with BERT or DistilBERT? Just download the vocab file:

1curl -O https://huggingface.co/distilbert/distilbert-base-uncased/raw/main/vocab.txt

Cross-Platform CI Builds

OSStatus
Ubuntu
Windows
GitHub ActionsCI

Use Cases

  • Text pre-processing for ML models in C++
  • On-device NLP (no Python overhead)
  • High-performance CLI tools
  • Embedded systems with no runtime dependencies

Try It Now

Modern Text Tokenizer is live and ready for your projects.

Clone, compile, and tokenize in seconds: Modern Text Tokenizer