Tokenization
Research papers, repositories, and articles about tokenization
Showing 4 of 4 items
Towards Scalable Pre-training of Visual Tokenizers for Generation
Studies how to pre-train visual tokenizers at scale specifically for generative models, rather than piggybacking on CLIP-like encoders. The paper explores architectures and training setups that produce discrete visual tokens that are more generation-friendly, with released models on GitHub. Visual tokenization is increasingly the bottleneck for efficient, high-fidelity image and video generation, so a focused treatment here is quite timely.
Back to Bytes: Revisiting Tokenization Through UTF-8
The authors propose UTF8Tokenizer, which maps bytes directly to token IDs and encodes control signals using old-school control bytes. This keeps embedding tables tiny, speeds up tokenization, and can be bolted onto existing models to improve convergence without changing how you run them.
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
Trains a tokenizer and autoregressive image model together, letting generation feedback directly improve the tokenization scheme. Hits state-of-the-art ImageNet 256×256 scores without guidance. If you build discrete image generators, this supports fusing tokenizer and generator into one training pipeline. ([huggingface.co](https://huggingface.co/papers/2605.00503))
CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
Introduces a new image tokenizer that respects pixel order over time or space. Aims at sharper, more stable video and streaming image generation.