I’ve been teaching myself Chinese over the last few months while also absorbing the ambient LLM content online. Consequently, I’ve been thinking about how the compounding / stacking of Chinese characters is strikingly similar to tokenization methods in modern LLMs.
LLMs typically tend to process text by first splitting them into pieces (called “tokens”) that then get fed into a transformer model. Every major model has its own slightly different way of tokenizing text, but most are built on top of the Byte Pair Encoding (BPE) algorithm. BPE essentially consists of scanning through a corpus of text to find the most frequently occurring pair of bytes (denoting characters) and “merging” them to create a new token to economically represent that pair. This is done several times iteratively to create a new vocabulary of composite tokens that can succinctly represent any new text.
The Chinese script frequently combines characters to form new compounds too. While the underlying procedure outlined by 6 methods under 六书 (the rules for compounding characters) is quite different from how LLMs do it, the 会意字 (huìyìzì) and 形声字 (xíngshēngzì) methods stood out to me as somewhat similar to BPE.
会意字 involves character formation by combining two or more meaningful components, with semantic relationship dictating the resulting meaning. For example, 休 (“rest”) = 人 (person) + 木 (tree) i.e., “a person leaning against a tree.” If you squint enough, you can think of this as the combination of two meaningful “tokens” into a larger semantic token.
形声字 involves character formation by combining a semantic component and a phonetic component. This one tenuously resembles tokenization, and is more a means to disambiguate between homophones by adding a semantic clue for context. For example, both 河 and 荷 are pronounced hé but the former has the “water” radical (氵) indicating the word means river, and the later has “plant” (艹) indicating the word is likely lotus¹. An English equivalent of this would be disambiguations like bank (financial) vs bank (river).
In both LLMs and 六书, the result is the ability to create a potentially infinite number of tokens by compositing from a finite set of tokens or characters. In case of Chinese, its ~200 characters such as 氵 water, 言 speech, 心 heart, and some simple phonetic characters; and in case of BPE, it’s 256 utf-8 bytes, some pre-merge special characters (¡,•£,º,ƒ,®,´,∆ etc), and extension to support non-latin script.
I initially misunderstood tokens to be more like ligature in fonts since the most naive implementations of BPE tend to merge and encode short character sequences. But when you iterate over 50k or 100k merges, entire words or phrases will likely have turned into a single token. This is where tokenization and Chinese character compounding resemble each other the most. It’s slippery when you try to define the nature of a 汉字 “character” — is it a syllable? Not always. Is it a word? Not quite, because 詞 is also word with more than one character. How about a pictogram? A morpheme? There’s a similar elusiveness when trying to pinpoint what a “token” in an LLM really is. Statistical tokenization methods blur the line between semantic roots, words and phrases in a way that they were already blurred in Chinese.
For example, the word “for” is common enough in the training data and has its own token in the cl100k_base vocab: 2000. “get” is 456. The two words together can hence be [2000,456] but since this pair sufficiently recurs in the training data, it is merged into a single token: “forget” is 41119. An important thing to note here is that the word “forget” isn’t exactly a semantic combination of the words “for” and “get” (as opposed to something like “notebook”, which means whatever the words “note” and “book” together mean).
I don’t know Chinese enough to get a deep cut example, but a loose parallel case would be 息, which means “to rest” or “to catch a breath”. The character for this is a compound, built from the stacking of two other letters — 自, historically meaning “nose”, and 心, meaning “heart”/”spirit”. The combination of these characters, 息 is not simply “nose-heart” but a deeper semantic compound (with a metaphorical gestalt) that relates to the self and the spirit, or well-being.
It’s worth mentioning that radicals (like 氵) never appear in isolation in contemporary parlance, and are always part of another word. In comparison, tokens in LLM are very atomic; at its core the base characters are all UTF-8 encoded and can appear on their own. Tokenizers are not without their quirks though: the gpt-4o tokenizer encodes “notebook” as [“not”,”ebook”] as opposed to [“note”,”book”].
The strategies under 六书 have evolved over hundreds of years, and though there likely were efficiency gains (like with a stenographers’ shorthand), it seems to me like this was primarily meant to evolve a system that addressed the prevailing divergence between written and oral language traditions at the time.
In contrast, the merging of tokens to create new ones is strictly driven by statistics. BPE is a 1990s technology developed for data compression. Its provenance can be traced back to variable length encoding (Huffman coding, Lempel-Ziv, etc). The use of statistical methods to encode/analyze information itself is a cold-war era technique stemming from Claude Shannon’s work in Information Theory. The use of BPE in natural language processing got popular in the 2010s and was crucial in machine translation research. LLM tokenizers are simply a logical progression of this².
I’m tempted to compare our current stage of tokenizer systems to the “Warring States” era of Chinese prior to Qin unification: a period where semantic and phonetic combinations were both normal, and character compounding was starting to be viewed as a matter of efficiency for bureaucrats (revealed in their preference for single-syllable words at the time.) LLM tokenizers today are similarly sophisticated but not standardized. GPT's Tiktokenizer tokenizes text differently from SentencePiece or UnigramLM, resulting in very different token vocabularies.
We’re certainly not at the Qin / Han-era standardization yet (LLMs still perform poorly in certain languages). That said, it is reasonable to expect tokenization standards to materialize soon – potentially utilizing UTF-8 as the basis for text, achieving better multilingual robustness. The emergence of a “vocab schema” that can support a variety of models with backwards compatibility seems plausible too. One could also look forward to the Buddhist-era of LLMs someday, though there’s still a ways to go in terms of integrating loan words as they evolve out of human internet/literary usage (I can’t wait for Gemini to throw the kanji for loss meme at me someday: :.|:;).
I’m far from an expert in modern language models and Chinese, and so all of this is based on a very loose analogy that starts to fall apart under close inspection. There are good reasons not to take this too seriously just yet.
▪︎
Thanks to Alex Yang and Stephen Koo for helping me get the nuances around Chinese characters and NLP history right at even the cost of going down some rabbit holes.
¹ It’s interesting that mandarin has come to collapse a ton of characters into the same pronunciation (诗 (poem), 师 (teacher) are both shī, 石 (stone), 时 (time), 食 (food/eat) are all shí). This is in stark contrast to a language like Tamil, where several phonetic sounds are collapsed into the same character (ப can either make the p or b sound, and ட can be t or d. You only infer this from context). This makes written Tamil records very ambiguous over time, and Tamil transliteration of Sanskrit writings a huge pain to read.
² To nobody’s surprise, China had its independent history with the information theory of their script! The history of the encoding methods for the Chinese telegraph system is fascinating. You could argue that Pinyin is a lookup system influencing text-encoding standards like GBK and Big5. I’m currently reading Kingdom of Characters by Jing Tsu, which goes over a lot of this.
Previous
← Ten Years in New York City