Tokenizer

Tools and algorithms for splitting text into tokens for NLP, search engines, and language modeling.

Introducing Tiny BPE TrainerMost modern NLP models today from GPT to RoBERTa, rely on subword tokenization using Byte Pair Encoding (BPE). But what if you want to train your own vocabulary in pure C++? Meet Tiny BPE Trainer - a blazing-fast, header-only BPE trainer written in modern C++17/20, with zero dependencies,...

Introducing Modern Text TokenizerModern natural language processing (NLP) models like BERT, DistilBERT, and other transformer-based architectures rely heavily on effective tokenization. But C++ developers often face limited options like bloated dependencies, poor Unicode support, or lack of compatibility with...