The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
Visualising contextualised large language model embeddings with context
deep learning
LLMs
visualisation
Large language models (predictably) learn to represent the semantic meaning of sentences.
A follow up on this post.
Imports
Toggle cells below if you want to see what imports are being made.
Utils
Use [CLS]
pooling according to this:
def perform_distance_comparison(s1, s2, s3):
euclidean_dist_1 = torch.linalg.vector_norm(s1 - s2).item()
euclidean_dist_2 = torch.linalg.vector_norm(s1 - s3).item()
print(f"|s1 - s2| = {euclidean_dist_1:.3f}")
print(f"|s1 - s3| = {euclidean_dist_2:.3f}")
print(f"|s1 - s2| < |s1 - s3| = {euclidean_dist_1 < euclidean_dist_2}")
cosine_sim_1 = F.cosine_similarity(s1[None, :], s2[None, :])[0].item()
cosine_sim_2 = F.cosine_similarity(s1[None, :], s3[None, :])[0].item()
print(f"sim(s1, s2) = {cosine_sim_1:.3f}")
print(f"sim(s1, s3) = {cosine_sim_2:.3f}")
print(f"sim(s1, s2) > sim(s1, s3) = {cosine_sim_1 > cosine_sim_2}")
Easier example
sentence_1_transformers = compute_sentence_embedding(sentence_1, model, tokenizer)
sentence_2_transformers = compute_sentence_embedding(sentence_2, model, tokenizer)
sentence_3_transformers = compute_sentence_embedding(sentence_3, model, tokenizer)
sentence_1_transformers.shape, sentence_2_transformers.shape, sentence_3_transformers.shape
Num tokens: 4
Num tokens: 4
Num tokens: 4
(torch.Size([768]), torch.Size([768]), torch.Size([768]))
Harder example
sentence_1_transformers = compute_sentence_embedding(sentence_1, model, tokenizer)
sentence_2_transformers = compute_sentence_embedding(sentence_2, model, tokenizer)
sentence_3_transformers = compute_sentence_embedding(sentence_3, model, tokenizer)
sentence_1_transformers.shape, sentence_2_transformers.shape, sentence_3_transformers.shape
Num tokens: 11
Num tokens: 9
Num tokens: 8
(torch.Size([768]), torch.Size([768]), torch.Size([768]))
Try the same with a text embedding model
sentence_1_transformers = compute_sentence_embedding(sentence_1, model, tokenizer)
sentence_2_transformers = compute_sentence_embedding(sentence_2, model, tokenizer)
sentence_3_transformers = compute_sentence_embedding(sentence_3, model, tokenizer)
sentence_1_transformers.shape, sentence_2_transformers.shape, sentence_3_transformers.shape
Num tokens: 11
Num tokens: 9
Num tokens: 8
(torch.Size([1024]), torch.Size([1024]), torch.Size([1024]))