Text Processing

Methods and techniques for analyzing, transforming, and managing text data for applications such as search, NLP, and automation.

Introducing Tiny BPE TrainerMost modern NLP models today from GPT to RoBERTa, rely on subword tokenization using Byte Pair Encoding (BPE). But what if you want to train your own vocabulary in pure C++? Meet Tiny BPE Trainer - a blazing-fast, header-only BPE trainer written in modern C++17/20, with zero dependencies,...

Introducing Modern Text TokenizerModern natural language processing (NLP) models like BERT, DistilBERT, and other transformer-based architectures rely heavily on effective tokenization. But C++ developers often face limited options like bloated dependencies, poor Unicode support, or lack of compatibility with...