Are Llama 3 and GPT-4 tokenizers the same?

deep learning
LLMs
tokenization
There seems to be a lot overlap between the tokenizers of Llama 3 and GPT-4. How similar are they?
Author

Augustas Macijauskas

Published

May 6, 2024

# Autoreload modules
%load_ext autoreload
%autoreload 2
from concurrent.futures import ThreadPoolExecutor
import multiprocessing as mp

import tiktoken
from transformers import AutoTokenizer

# Number of parallel threads (adjust as needed)
NUM_CPUS = mp.cpu_count()
llama_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
gpt_4_tokenizer = AutoTokenizer.from_pretrained("Xenova/gpt-4")

llama_vocab = llama_tokenizer.get_vocab()
gpt_4_vocab = gpt_4_tokenizer.get_vocab()

len(llama_vocab), len(gpt_4_vocab)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(128256, 100263)
# Check exactly equal
vocab_1 = [x[0] for x in sorted(llama_vocab.items(), key=lambda x: x[1])]
vocab_2 = [x[0] for x in sorted(gpt_4_vocab.items(), key=lambda x: x[1])]
print(len(vocab_1), len(vocab_2))

for idx, (token_1, token_2) in enumerate(zip(vocab_1, vocab_2)):
    if token_1 != token_2:
        print(f"Token mismatch at {idx}: {token_1} != {token_2}")
128256 100263
Token mismatch at 100256: ĠÙ != <|endoftext|>
Token mismatch at 100257: ا٠!= <|fim_prefix|>
Token mismatch at 100258: าภ!= <|fim_middle|>
Token mismatch at 100259: ÑŁ != <|fim_suffix|>
Token mismatch at 100260: ÑŁÑŁ != <|im_start|>
Token mismatch at 100261: Ġภ!= <|im_end|>
Token mismatch at 100262: à¹Ģภ!= <|endofprompt|>

Compare GPT-4 to tiktoken

from datasets import load_dataset

split = "train"
english_dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split=split)
korean_dataset = load_dataset("lcw99/wikipedia-korean-20221001", split=split)
code_dataset = load_dataset("code_search_net", "python", split=split, trust_remote_code=True)
code_dataset = code_dataset.rename_column("whole_func_string", "text")  # Rename whole_func_string to text
print(len(english_dataset), len(korean_dataset), len(code_dataset))
print(len(english_dataset) + len(korean_dataset) + len(code_dataset))

n = 100000
final_dataset = (
    english_dataset.shuffle(42).select(range(min(n, len(english_dataset))))["text"] +
    korean_dataset.shuffle(42).select(range(min(n, len(korean_dataset))))["text"] +
    code_dataset.shuffle(42).select(range(min(n, len(code_dataset))))["text"]
)
print(f"{len(final_dataset)=}")
1801350 607256 412178
2820784
len(final_dataset)=300000
final_dataset[min(n, len(english_dataset))][:50]
'왕종린(, 1961년 ~ )은 미국의 중국계 물리학자로 중국 과학원 외국계 원사이다. 해양'
gpt_4_tiktoken_tokenizer = tiktoken.get_encoding("cl100k_base")


def check_tokenizers_worker(test_string):
    hf_output = gpt_4_tokenizer.encode(test_string)
    tiktoken_output = gpt_4_tiktoken_tokenizer.encode(test_string)
    return hf_output == tiktoken_output


with ThreadPoolExecutor(max_workers=NUM_CPUS) as executor:
    # Apply the function to each item in parallel
    results = list(executor.map(check_tokenizers_worker, final_dataset))

all(results)
True

So the conclusions (at least for the dataset sampled above) seem to be:

  1. The first 100256 tokens of the Hugging Face implementations of Llama 3 and GPT-4 tokenizers seem to be the same.
  2. Hugging Face’s GPT-4 tokenizer is identical to the one from tiktoken, at least for the dataset sampled above.
  3. So Llama 3 and GPT-4 have very similar vocabularies, even though OpenAI never detailed how they trained their tokenizer.
  4. What’s going on???