trailtoken
Not all tokenizers are equally good at tokenizing various pieces of text. For example, most of them are rather inefficient at tokenizing emojis ๐๐ฉโ๐ฆโ๐ฆ ๐ฉโ๐งโ๐ฆ ๐ฉโ๐งโ๐ง ๐ฉโ๐ฉโ๐ฆ ๐ฉโ๐ฉโ๐ง, but the Llama 3 tokenizer is much better at tokenizing non-Latin script languages, such as Chinese (notice how all characters are no more than one token each): ็ฑ้็ๅฉ็ต็น
Tokenize
Token count
0
Show whitespace
Built by
Augustas Macijauskas
and
Laurynas Lopata