Are Llama 3 and GPT-4 tokenizers the same?
deep learning
LLMs
tokenization
There seems to be a lot overlap between the tokenizers of Llama 3 and GPT-4. How similar are they?
llama_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
gpt_4_tokenizer = AutoTokenizer.from_pretrained("Xenova/gpt-4")
llama_vocab = llama_tokenizer.get_vocab()
gpt_4_vocab = gpt_4_tokenizer.get_vocab()
len(llama_vocab), len(gpt_4_vocab)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(128256, 100263)
# Check exactly equal
vocab_1 = [x[0] for x in sorted(llama_vocab.items(), key=lambda x: x[1])]
vocab_2 = [x[0] for x in sorted(gpt_4_vocab.items(), key=lambda x: x[1])]
print(len(vocab_1), len(vocab_2))
for idx, (token_1, token_2) in enumerate(zip(vocab_1, vocab_2)):
if token_1 != token_2:
print(f"Token mismatch at {idx}: {token_1} != {token_2}")
128256 100263
Token mismatch at 100256: ĠÙ != <|endoftext|>
Token mismatch at 100257: ا٠!= <|fim_prefix|>
Token mismatch at 100258: าภ!= <|fim_middle|>
Token mismatch at 100259: ÑŁ != <|fim_suffix|>
Token mismatch at 100260: ÑŁÑŁ != <|im_start|>
Token mismatch at 100261: Ġภ!= <|im_end|>
Token mismatch at 100262: à¹Ģภ!= <|endofprompt|>
Compare GPT-4 to tiktoken
from datasets import load_dataset
split = "train"
english_dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split=split)
korean_dataset = load_dataset("lcw99/wikipedia-korean-20221001", split=split)
code_dataset = load_dataset("code_search_net", "python", split=split, trust_remote_code=True)
code_dataset = code_dataset.rename_column("whole_func_string", "text") # Rename whole_func_string to text
print(len(english_dataset), len(korean_dataset), len(code_dataset))
print(len(english_dataset) + len(korean_dataset) + len(code_dataset))
n = 100000
final_dataset = (
english_dataset.shuffle(42).select(range(min(n, len(english_dataset))))["text"] +
korean_dataset.shuffle(42).select(range(min(n, len(korean_dataset))))["text"] +
code_dataset.shuffle(42).select(range(min(n, len(code_dataset))))["text"]
)
print(f"{len(final_dataset)=}")
1801350 607256 412178
2820784
len(final_dataset)=300000
'왕종린(, 1961년 ~ )은 미국의 중국계 물리학자로 중국 과학원 외국계 원사이다. 해양'
gpt_4_tiktoken_tokenizer = tiktoken.get_encoding("cl100k_base")
def check_tokenizers_worker(test_string):
hf_output = gpt_4_tokenizer.encode(test_string)
tiktoken_output = gpt_4_tiktoken_tokenizer.encode(test_string)
return hf_output == tiktoken_output
with ThreadPoolExecutor(max_workers=NUM_CPUS) as executor:
# Apply the function to each item in parallel
results = list(executor.map(check_tokenizers_worker, final_dataset))
all(results)
True
So the conclusions (at least for the dataset sampled above) seem to be:
- The first 100256 tokens of the Hugging Face implementations of
Llama 3
andGPT-4
tokenizers seem to be the same. - Hugging Face’s
GPT-4
tokenizer is identical to the one fromtiktoken
, at least for the dataset sampled above. - So
Llama 3
andGPT-4
have very similar vocabularies, even though OpenAI never detailed how they trained their tokenizer. - What’s going on???