Back to AI Lab

Tokenization

Research papers, repositories, and articles about tokenization

Showing 4 of 4 items

Towards Scalable Pre-training of Visual Tokenizers for Generation

Studies how to pre-train visual tokenizers at scale specifically for generative models, rather than piggybacking on CLIP-like encoders. The paper explores architectures and training setups that produce discrete visual tokens that are more generation-friendly, with released models on GitHub. Visual tokenization is increasingly the bottleneck for efficient, high-fidelity image and video generation, so a focused treatment here is quite timely.

Jingfeng Yao, Yuda Song

Back to Bytes: Revisiting Tokenization Through UTF-8

The authors propose UTF8Tokenizer, which maps bytes directly to token IDs and encodes control signals using old-school control bytes. This keeps embedding tables tiny, speeds up tokenization, and can be bolted onto existing models to improve convergence without changing how you run them.

Amit Moryossef, Clara Meister

End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

Trains a tokenizer and autoregressive image model together, letting generation feedback directly improve the tokenization scheme. Hits state-of-the-art ImageNet 256×256 scores without guidance. If you build discrete image generators, this supports fusing tokenizer and generator into one training pipeline. ([huggingface.co](https://huggingface.co/papers/2605.00503))

Wenda Chu, Bingliang Zhang

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

Introduces a new image tokenizer that respects pixel order over time or space. Aims at sharper, more stable video and streaming image generation.

Yitong Chen, Zuxuan Wu