In a recent post, I shared a tool developed alongside a dear friend of mine, designed for visualizing the tokenizers of language models. Following that post, I received inquiries about my views on visualizing the embeddings learned by Large Language Models (LLMs) during their pre-training phase. This post aims to serve as a guide detailing one possible way to visualizing the high-dimensional embeddings of LLMs, while also delving into the interesting patters that become apparent through such visualizations. The findings are quite intriguing, so continue reading to discover more!

Imports

Toggle cells below if you want to see what imports are being made.

Code

%load_ext autoreload
%autoreload 2

Code

import re

import plotly.graph_objects as go
from sklearn.manifold import TSNE
from transformers import AutoModel, AutoTokenizer

# Ensures we can render plotly plots with quarto
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"

Code

# Put default plotly colors into a variable
import plotly.express as px
DEFAULT_PLOTLY_COLORS = px.colors.qualitative.Plotly

# Put full month names into a constant
import calendar
MONTHS = [month for month in calendar.month_name if month]

COUNTRY_TOKENS = [
    "state", "states", "international", "world", "united", "washington",
    "california", "uk", "america", "american", "british", "australia",
    "australian", "canada", "english", "french", "german", "russian",
    "european", "europe", "france", "germany", "england", "london",
    "york", "japanese", "chinese", "japan", "china", "indian", "india"
]

Dimensionality reduction

As you can see below, the code uses very standard libraries and functions. We begin by loading a model and extracting the tensor containing the embedding vectors:

model_name = "google-bert/bert-base-cased"
model = AutoModel.from_pretrained(model_name)

embedding_vectors = model.embeddings.word_embeddings.weight.data
embedding_vectors.shape

torch.Size([28996, 768])

We also extract the unique tokens in the tokenizer:

tokenizer = AutoTokenizer.from_pretrained(model_name)
vocab = tokenizer.get_vocab()

tokens = sorted(tokenizer.get_vocab().items(), key=lambda item: item[1])
tokens = [item[0] for item in tokens]
tokens[:5]

['[PAD]', '[unused1]', '[unused2]', '[unused3]', '[unused4]']

The essence of the code is below. We use the t-SNE dimensionality reduction technique to reduce the dimensionality of the embedding vectors to 2.

Alternatively, one could use the UMAP8 library to compute the embeddings. The code runs much faster for large embedding matrices, but we are only going to use a subset of the embedding matrix in this post, so t-SNE will do just fine. However, I will leave the code for the interested reader to try out:

import umap
reducer = umap.UMAP()
data_2d = reducer.fit_transform(embedding_vectors[NUM_TOKENS_TO_SKIP:NUM_TOKENS_TO_SKIP+NUM_TOKENS_TO_VISUALISE])

NUM_TOKENS_TO_VISUALISE = 2000
NUM_TOKENS_TO_SKIP = 100

# Performing t-SNE to reduce the dataset to 2 dimensions
tsne = TSNE(n_components=2, random_state=42)
data_2d = tsne.fit_transform(
    embedding_vectors[NUM_TOKENS_TO_SKIP:NUM_TOKENS_TO_SKIP+NUM_TOKENS_TO_VISUALISE]
)
data_2d.shape

(2000, 2)

The code to visualise the embeddings is below. The first cell contains mundane code to color the various token groups, toggle the cell if you are interested to see it.

Code

relevant_tokens = tokens[NUM_TOKENS_TO_SKIP:NUM_TOKENS_TO_SKIP + NUM_TOKENS_TO_VISUALISE]

# Define default colors
colors = [
    (DEFAULT_PLOTLY_COLORS[0], token) for token in relevant_tokens
]

# Color numbers
colors = [
    (DEFAULT_PLOTLY_COLORS[1] if token.isdigit() else color, token) for color, token in colors
]

# Color tokens that start with ##
colors = [
    (DEFAULT_PLOTLY_COLORS[2] if token.startswith("##") else color, token) for color, token in colors
]

# Color months
colors = [
    (DEFAULT_PLOTLY_COLORS[3] if token in MONTHS else color, token) for color, token in colors
]

# Color special tokens
colors = [
    (DEFAULT_PLOTLY_COLORS[4] if token in tokenizer.special_tokens_map.values() else color, token)
    for color, token in colors
]

# Color single letters
colors = [
    (DEFAULT_PLOTLY_COLORS[5] if re.match(r"^[a-zA-Z]$", token) else color, token) for color, token in colors
]

# Color country tokens
colors = [
    (DEFAULT_PLOTLY_COLORS[6] if token.lower() in COUNTRY_TOKENS or token == "US" else color, token)
    for color, token in colors
]

# Leave just the colors
colors = [color for color, _ in colors]

# Define cluster names
cluster_names = [
    "default", "numbers", "tokens that start with ##", "months",
    "special tokens", "single letters", "countries",
]
unique_colors = DEFAULT_PLOTLY_COLORS[: len(cluster_names)]

fig = go.Figure(
    data=go.Scatter(
        x=data_2d[:, 0],
        y=data_2d[:, 1],
        mode="markers",
        marker=dict(color=colors),
        hovertext=relevant_tokens,
        hoverinfo="text",
        showlegend=False
    )
)

# Create dummy traces to have a nice legend
for name, color in zip(cluster_names, unique_colors):
    fig.add_trace(go.Scatter(
        x=[None], y=[None],
        mode="markers", marker=dict(color=color), name=name
    ))

fig.update_layout(title="Token embeddings", legend=dict(title="Clusters"))

fig.show()

We can see that there is a lot of semantic structure in the learnt embeddings! For instance, I have highlighted the clusters for names of the months, countries/nationalities, numbers, etc. Of course, one could have expected this to happen given all the theory behind methods like word2vec, but it is still exciting that the semantic structure is still present in the embedding layer even though the model much deeper and the LLM pre-training objective is completely different from word2vec!

Conclusion

I was very pleased to see that, as expected, the learnt patters have a nice semantic structure. I encourage the interested reader to explore the ideas in this post in more depth, e.g. try a different pre-trained LLM, try the UMAP dimensionality reduction method instead of t-SNE, or just play around with the number of embedded vectors and try to find more semantic clusters.

Thank you for reading! I hope you enjoyed the post and am looking forward to hearing your feedback.

Next steps

I would like to explore the idea of clustering these embeddings in the future and trying to look for what clusters emerge and what tokens get grouped into the same cluster. Again, please be encouraged to try it yourself and let me know about the results!