Explaining the cosine similarity of 2 vectors: an important measure of language proximity in Natural Language Processing (NLP)

We hope you find this tutorial useful, please share it if you do!

Authors

Cosine similarity is a widely used measure of language proximity or ‘closeness’ in Natural Language Processing (NLP). This tutorial will explain what cosine similarity is, by setting out some of the background mathematics using worked examples and Python code. The tutorial will then cover the applications of cosine similarity as it is commonly used in NLP, using worked code examples.

1. Background: the relevance of Vector Mathematics to NLP

Natural language processing (NLP) is a field within machine learning which deals with ‘natural language’ i.e. human language or speech. As this speech takes the form of text: word parts, complete words, sentences paragraphs and whole documents. In order to have value to us in machine learning – whether via classification, prediction or generation of new text – the text must be turned into a numerical format or representation in order to be used in machine learning models. That numerical format is a high-dimensional vector.

In all the examples of NLP applications we give in this tutorial, the operations are carried out on text embeddings and, if we are comparing two sets of text embeddings, their cosine similarity.

1.1 Text Embeddings

A text embedding is the representation of text as a vector, where that vector is an extracted layer of a trained machine learning model. This unit of text might be a part of a word (a ‘word piece’), a whole word, a sentence, a paragraph or a whole document, depending on the context and technique being used.

The act of converting the relevant unit of text into a vector is frequently called encoding or embedding. This is slightly confusing in that those two verbs (encode, embed) are frequently used interchangeably, but ‘an encoding‘ is different from ‘an embedding‘.

Text embeddings are extracted from a trained language model, for example those shown Section 6. These are intended to group similar concepts (often words) together, identify relationships between words and understand syntactic elements of text. This extracted layer is a lower-order (lower dimensional) vector representation of the original text and can be conceptualized as a slightly more compact ‘thought vector’. That said, the vector representations of text arising from a trained model (text embeddings) still have a fairly high number of dimensions in order to capture the complexity of meaning of the language they represent (for example, 384 dimensions, as we will see in the examples below).

1.2 Text Encodings

Text encodings (and likewise label encodings) are in contrast not trained, but are a mapping of text pieces (e.g. single words) into an indexed number in a vocabulary. For example, the ‘bag of words’ (‘BOW’) method of representing text for modelling purposes does not preserve word order, but instead detects the presence or absence of certain words we send to a model. If the word is in our ‘vocabulary’ which is collected from the most important words from all the training samples, then it is denoted in the correctly indexed spot in the vector with a ‘1’, otherwise the vector contains a zero at that point.

More confusingly , the deep learning package Keras has what it terms ‘an embedding layer‘ which in fact encodes text i.e. it “turns positive integers (indexes) into dense vectors of fixed size” [1] https://keras.io/api/layers/core_layers/embedding/

1.3 Embeddings and Encodings as Vectors

Anyway, back to vectors: So, what has this got to do with vectors? Well, both encodings and embeddings are vectors but as text embeddings have been trained to capture the meaning of the text, for the purposes of calculating and applying cosine similarity, we want to consider these (trained) text embeddings only. Text encodings are not relevant here as running cosine similarity on them would only give a measure as to how close the words were to one another as they appear in the original training text, but not give an indication of how close they are in terms of meaning.

Since we are looking at text embeddings (which are vectors), we therefore need to understand what vectors are. Additionally, we will be comparing sets of two text embedding (two vectors), so we also need to know how to calculate their cosine similarity with each other.

A vector is a quantity which is determined by both its magnitude and direction; therefore it can be depicted by an arrow or a directed segment of line. Velocity, accelerating and displacement are all examples of vectors. For example, velocity gives both the speed and direction of motion, displacement the distance travelled and its direction of travel.

2. Calculating the Dot Product and Cosine Similarity of two Vectors

Getting the cosine of an angle between two vectors, starts with calculating the Dot Product of those vectors. The ‘Dot Product’ [2] page 408-409 Chapter 8.2 Advanced Engineering Mathematics E. Kreyszig also referred to as the ‘Scalar Product’ or ‘Inner product’.

It is called the ‘Scalar Product‘ because, whatever the size (number of dimensions) of the vectors, the result of the dot product is always a scalar – a single value, not a vector.

Its ‘Inner Product‘ name is explained by the approach described in the section Approach 3: Vector dot product as a special case of matrix multiplication. There are three ways to think about calculating the dot product, and these are explained below.

2.1 Approach 1: dot product by calculating Vector lengths and the angle between them

The inner product, dot product or scalar product $ a · b $ ( expressed as “a dot b”), as shown in equation 1:

$ A · B = |A| * |B| * cos(θ) \hspace{10em} $(1)

In equation 1, |A| and |B| represent the magnitude (length) of each vector and $θ$ represents the angle between between the vectors. The dot product of two vectors A and B therefore is the product of their lengths times the cosine of their angle. Computing the dot product assumes that (and is only possible if) the both vectors coincide, as they do at the origin $0,0$ and $0,0,0$ in Figure 3 and Figure 4 respectively.

2.2 Approach 2: Vector dot product in Cartesian coordinate form

When you express vectors in Cartesian coordinate form (i.e. in axes which are at 90 degrees to one another as shown in Figure 3), the dot product can be calculated by multiplying corresponding dimensions sequences (arrays) then adding the result of all the multiplied terms. Note that in order to be able to calculate the dot product, the number of dimensions of each vector should be the same.

For example, if we have vector A, where the size of the component of vector A in the x axis is $a_1$, its size in the y axis is $a_2$, size in the z axis is $a_3$ (note, you can have many more dimensions than the three given here as an illustration), then vector A can be expressed by equation 2. Similarity, for vector B, its components in each dimension are denoted by $b_1$ and so on (equation 3). The dot product of A and B therefore is shown by Equation 4, up to $n$ number of dimensions.

$A = [ a_1, a_2… a_n ] \hspace{14em} (2)$

$ B = [ b_1, b_2… b_n ]\hspace{14em} (3)$

$A · B = [a_1*b_1 + a_2*b_2 + …a_n*b_n]\hspace{6em} (4)$

2.3 Approach 3: Vector dot product as a special case of matrix multiplication

The third approach is really a special case of matrix multiplication [3] Section 6.2 p315 Erwin Kreyszig Advanced Engineering Mathematics and is very similar to the section above (Cartesian co-ordinate form). This treats the vectors as $ (1, n) $ matrices, and in order to take the dot product i.e. multiply them together, must transpose one so that it becomes a $ (n, 1) $ matrix. Here, B is transposed to be a column vector, whilst A remains a row vector.

If $a$ is a row vector and $b$ a column vector both with $n$ components, then matrix multiplication (row x column) gives a $1×1$ matrix i.e. a scalar (a single value). In order to multiply the two vectors together however, we have to transpose one (i.e. make the rows into columns and the columns into rows).

$$ B= \left( \begin{matrix} b_1 & b_2 & b_3\end{matrix} \right) \rightarrow B^T = \left( \begin{matrix} b_1 \\b_2\\b_3 \end{matrix} \right) \hspace{6em}(5)$$

The dot product matrix multiplication therefore that is carried out is $ A*B^T$ where examples of three dimensional matrices are shown in Figure 1, although this approach will apply to matrices of any dimension.

vectors A and B (transposed) so they can be multiplied together
Figure 1: Showing Vectors A and B (transposed) as matrices so they can be multiplied in the form $A * B^T$

It has been mentioned above that the dot product is also called the ‘inner product’; the multiplication rules shown in Figure 2 (in this case illustrated for the example matrices in Figure 1) are

Dimensions for matrix multiplication: inner dimensions must match to enable multiplication outer dimensions determine resulting matrix size
Figure 2: Dimensions for matrix multiplication: inner dimensions must match to enable multiplication outer dimensions determine resulting matrix size

In effect, this matrix multiplication results in the same set of multiplication operations with a final addition as described in the section called Vector dot product in cartesian co-ordinate form. Hence, multiplying the two vectors as matrices also gives exactly the same result as Equation 5.

2.4 Vector Dimension and linear independence

2.4.1 Vector algebraic operations (addition and multiplication)

If we have unit vectors for example $a_1, a_2, a_3 ….a_n$, any of these vectors e.g. $a_2$ can be scaled (i.e. multiplied by a scalar). These arbitrary scalar numbers can be defined for these purposes as $c_1, c_2, c_3, c_n$. In addition, the vectors can be multiplied by any scalar, for example $c_1a_1$, or $c_2a_2$.

These vectors $a_1, a_2, a_3 ….a_n$ as well as being scaled, can be added together to make a new vector for example $c_1a_1 + c_2a_2$. [4] page 406 Chapter 8 Kreyszig Advanced Engineering Mathematics

Accordingly, equation (6) is a linear combination of a set of other vectors (e.g. $a_2$), each scaled by a scalar (e.g. $c_2$), and equation (6) is itself is a vector:

$ c_1a_1 +c_2a_2+ c_3a_3 +…c_na_n \hspace{6em}(6) $

2.4.2 Linear Dependence and Independence of vector components

2.4.2.1 Linear Independence

The vector components $a_1, a_2, a_3…$ described in 2.4.1 are called a linearly independent set if and only if the vector represented in equation (6) is only equal to zero, when all the scalars in it are equal to zero i.e. $c_i = 0$ for all $i = 1 , … , n$

Therefore for the vectors in the sequence to be linearly independent, no vector in the sequence $a_1c_1$ etc. can be represented as a linear combination of the remaining vectors in the sequence. [5] https://en.wikipedia.org/wiki/Linear_independence#Evaluating_linear_independence

2.4.2.2 Linear dependence

Conversely, a set of vectors, for example in Equation (6), is linearly dependent if one or more of the vectors in the set is zero or a linear combination of the others. Therefore if equation 6 is for example equates zero, when one or more of its scalars $c_1, c_2$ is non-zero, then the overall vector set is linearly dependent.

2.4.3 Dimension and Vector Space

$ c_1a_1 +c_2a_2+ c_3a_3 \hspace{6em}(7) $

The set of all vectors in equation (7) forms a real three dimensional vector space denoted as $R^3$. An example of this three dimensional vector space using $x, y, z$ as the dimensions, is shown in Section 4.

$R^3$ contains linearly independent sets of 3 vectors, so that the maximum possible number of vectors in a linearly independent set in $R^3$ is 3. A vector comprised of 4 or more vector sets for example $c_1a_1 + c_2a_2 + c_3a_3 + c_4a_4$ in dimensional vector space $R^3$ would be a dependent vector [6] page 406 chapter 8 Erwin Kreyszig Advanced Engineering Mathematics

Text embeddings are column vectors extracted from a trained model and the number of vectors (which corresponds to the number of rows in the vector) is determined by the trained model used. For example, the trained ‘all-MiniLM-L6-v2’ model in Section 5 provides trained text embeddings of size $(384, )$. This corresponds to a set of 384 vectors i.e. $ c_1a_1 +c_2a_2 + c_3a_3 …..+ c_384a_384$. Assuming that none of the scalars $c_1….c_384$ are zero, the vector space can also be assumed to be $R^384$ and have 384 dimensions.

2.5 Cosine Similarity of Two Vectors

We have covered the dot product at sections 2.1 – 2.3. The cosine similarity of two vectors [7] page 591, Engineering Mathematics, 5th Edition: K.A. Stroud with Additions by Dexter J Booth is the cosine of the angle ‘between’ the two vectors – although at dimensions above three, this cannot be visualized.

It is notable that the cosine similarity is a scaled version of the dot product, as can be seen by inspection of Equation 8. The scaling factor (on the denominator) is scaling by the product of the lengths of the two vectors. In effect, this removes the effect of the lengths of the two vectors, as we only care about the cosine of the angle between them.

$$\cos (θ) = \frac{a · b}{| a | * | b | }\hspace{8em}(8)$$

This cosine of the angle (which will be in the range -1:1) is a measure of how close in meaning are these two pieces of text to each other. Note that cosine similarity only takes into account the similarity in direction or orientation of the two vectors, irrespective of their magnitudes. The closer to +1 the cosine of the angle between them is, the more similar the vectors.

Any vectors created as language embeddings, whether they are embeddings from individual words, sentences, paragraphs or whole documents can be directly compared using cosine similarity. For example a word or phrase’s vector could be compared directly with a vector which represents a whole document (for example, in semantic search), provided that the two vectors have the same number of dimensions.

3. Example 1: Calculating the Dot Product and cosine similarity of Two 2-Dimensional Vectors

two 2 dimensional vectors marked A and B shown in the x, y plane for calculation of their cosine similarity
Figure 3: Two 2-dimensional vectors marked ‘A’ and ‘B’ in the two dimensional X-Y plane

The arrows on vector A and vector B in Figure 3 indicate the direction of the vector, and the length of the line represents that vector’s magnitude (size).

3.1 Vector dot product using the vector lengths and angle between them

The dot product of the two vectors shown in Figure 1, is calculated in Equation (9):

$a · b = 6 * 7 * cos(60) = 21\hspace{8em}(9)$

import numpy as np

a_length = 6
b_length = 7

# numpy cosine takes radians not degrees as an input

dot_product = a_length * b_length * np.cos(np.pi*(60/180))
# output dot_product variable
dot_product
21

3.2 Dot product using vector Cartesian co-ordinates

In Figure 1, the length of the vectors and the angle they make with the $x$ axis is given. Using $sin$ and $cosine$ trigonometrical identities, we can calculate the vector component of A and B respectively in each of the X and Y axes (dimensions). Once the vector components have been calculated, the dot product is calculated using the Numpy package [8] https://numpy.org/doc/stable/reference/generated/numpy.dot.html package, and alternatively natively in Python:

import numpy as np
# np.cos and np.sin functions require radians rather than degrees
# using cos and sin as trig identities to calculate x and y axis component of the vectors

a1 = 6*np.cos(70*np.pi/180)
a2 = 6*np.sin(70*np.pi/180)
b1 = 7*np.cos(10*np.pi/180)
b2 = 7*np.sin(10*np.pi/180)

a_cartesian = [a1, a2]
b_cartesian = [b1, b2]

# dot product built into numpy
a_dot_b_cartesian = np.dot(a_cartesian,b_cartesian)

# print out the dot product answer
a_dot_b_cartesian
21.0

# alternatively the product can be calculated for each dimension then the result 
# summed
a_dot_b_cartesian_2 = a1*b1 + a2*b2
a_dot_b_cartesian_2
21.0

3.3 Dot product using matrix multiplication

# a1, a2, b1, b2 components of the two vectors calculated from above

a_cartesian = [a1, a2]
b_cartesian = [b1, b2]

a = np.array(a_cartesian)
# transpose b before matrix multiplication
b_transposed = np.transpose(np.array(b_cartesian))

np.matmul(a, b_transposed)
21.0

4. Example 2: Calculating the Dot Product and Cosine similarity of Two 3-Dimensional Vectors

Note that, for the three dimensional example of two vectors, the angle between the two vectors is calculated in a an $x, y, z$ plane on which those two vectors lie. In Figure 2, the angle between the two vectors between the two vectors $A, B$, called $\theta$ is marked and the plane on which both of those two vectors lies is shown as light grey.

Two 3-dimensional vectors marked 'A' and 'B' in the three dimensional X-Y-Z plane using right handed cartesian coordinates and showing theta the angle between them
Figure 4: Two 3-dimensional vectors marked ‘A’ and ‘B’ in the three dimensional X-Y-Z plane using right handed Cartesian coordinates and showing $\theta$ the angle between them

Numbers for vectors A and B can be put to the three dimensional diagram in Figure 4.

Let A be $3i + 4j +5k$ (unit vector components in the $x, y$ and $z$ directions respectively), such that $a1 = 3$, $a2=4$, $a3=5$ respectively. Vector A can be expressed in the form $[3, 4, 5]$ where the unit vector component notation has been removed. Similarity let B be $6i +6j + 8k$, expressed as $[6, 6, 8]$

4.1 Dot product using vector Cartesian co-ordinates

Expression of the vectors in three dimensions or easiest in Cartesian co-ordinate form. The following code calculates the 3 dimensional dot product via the built in numpy dot product and also natively via Python:

import numpy as np

a_cartesian_3dim = [3, 4, 5]
b_cartesian_3dim = [6, 6, 8]

# dot product built into numpy
a_dot_b_cartesian_3dim = np.dot(a_cartesian_3dim,b_cartesian_3dim)

# print out the dot product answer
print(a_dot_b_cartesian_3dim)
82


# alternatively the product can be calculated for each dimension then the result 
# summed
a_dot_b_cart_3dim = sum([a*b for a,b in zip(a_cartesian_3dim,b_cartesian_3dim)])

print(a_dot_b_cart_3dim)
82

4.2 Calculating vector lengths and angle between them

Once you get beyond two dimensions (x, y), the method of calculating the dot product via vector length and angle between vectors is still possible, but not particularly convenient. For the two 3-dimensional vectors referred to in section 4, their magnitude is calculated as follows:

# magnitude i.e. length of a vector calculated by numpy's linear algebra
import numpy as np

a_cartesian_3dim = [3, 4, 5]
b_cartesian_3dim = [6, 6, 8]

a_3dim = np.array(a_cartesian_3dim)
b_3dim = np.array(b_cartesian_3dim)

# calculate length of vectors using numpy's linear algebra
a_3dim_magnitude = np.linalg.norm(a_3dim)
b_3dim_magnitude = np.linalg.norm(b_3dim)

print(a_3dim_magnitude, b_3dim_magnitude)
7.0710678118654755 11.661903789690601

# calculate vector length via square root of the sum of the terms
a_magnitude = np.sqrt(np.sum(np.square(a_3dim)))
b_magnitude = np.sqrt(np.sum(np.square(b_3dim)))
print(a_magnitude, b_magnitude)

The cosine of the angle between the two vectors (which is not stated for the 3 dimensional example) is in fact the cosine similarity (to be discussed at Section 5) and can be calculated from Equation (9):

# get the cosine of the angle between the 3-dimensional vectors
3dim_angle_cosine = a_dot_b_cartesian_3dim/(a_3dim_magnitude * b_3dim_magnitude)

# print the cosine of the angle
3dim_angle_cosine

# get the angle in the range of 0 to pi radians
np.arccos(3dim_angle_cosine)

5. Language Embeddings are High Dimensional Vectors

5.1 Language Embeddings

There are a variety of pre-trained models and Python packages which are capable of creating word, sentence, paragraph and document embeddings. For example, which can be created from Word2Vec [9] https://radimrehurek.com/gensim/models/word2vec.html , Doc2Vec [10] https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.Doc2Vec , and embeddings from Huggingface Transformers [11] https://huggingface.co/blog/getting-started-with-embeddings .

The Sbert Sentence Transforms package [12] https://www.sbert.net/index.html generates universal sentence embeddings directly using siamese BERT networks. The BERT model has been used elsewhere via the Huggingface Transformers package in this School for Engineering post.

There is a range of pre-trained sentence transformers available from Sbert sentence transformers package all hosted on the Huggingface Model hub [13] https://www.sbert.net/docs/pretrained_models.html . The number of embedding dimensions depends on the trained model being used to generate the embeddings from the texts. For example, the sentence transformer model below, the ‘all-MiniLM-L6-v2’ [14] https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 which is based on a six layer version of Microsoft miniLM [15] https://huggingface.co/microsoft/MiniLM-L12-H384-uncased .

These embeddings created from the sentences are themselves high dimensional column vectors, with each dimension (denoted by an element in the array) representing the length of a vector. Therefore just as we had $ai + bj +ck$ with the coefficients $ a, b$ and $c$ representing the length of the vector in that axis, for the ‘all-MiniLM-L6-v2’ model, there are 384 different axes with a vector space of 384 dimensions.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

#Sentences are encoded by calling model.encode()
# note the term used here is 'encode' but the result is 
# an embedding, not an encoding

embedding1 = model.encode("This is a cat.")
embedding2 = model.encode("I love mostly eating turnips")
print('embedding dimensions', embedding1.shape, embedding2.shape)

# the output from the print function
# embedding dimensions (384,) (384,)

Here a different trained model is used to create embeddings the ‘all-mpnet-base-v2‘; note that it has embeddings which are column vectors with a shape of (768, ):

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-mpnet-base-v2')

embedding3 = model.encode("I have got a new combine harvester")
embedding4 = model.encode("My trousers are too long")
print('embedding dimensions', embedding3.shape, embedding4.shape)

# the output from the print function - 768 dimensions
# embedding dimensions (768,) (768,)

5.2 Cosine Similarity of Word Embeddings

These embeddings can then be compared with cosine similarity to find sentences with a similar meaning. This can be useful for [16] https://www.sbert.net/examples/applications/computing-embeddings/README.html :

# dot-product (util.dot_score), cosine-similarity (util.cos_sim)
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-mpnet-base-v2')

embedding1 = model.encode("This is a cat.")
embedding2 = model.encode("I love mostly eating turnips")
embedding3 = model.encode("I have got a new combine harvester")
embedding4 = model.encode("My trousers are too long")

cos_sim_1 = util.cos_sim(embedding1, embedding2)
dot_score_1 = util.dot_score(embedding1, embedding2)
print("Cosine-Similarity 1:", cos_sim_1, "dot product 1:", dot_score_1)
# Cosine-Similarity 1: tensor([[0.0579]]) dot product 1: tensor([[0.0579]])

cos_sim_2 = util.cos_sim(embedding3, embedding4)
dot_score_2 = util.dot_score(embedding3, embedding4)

print("Cosine-Similarity 2:", cos_sim_2, "dot product 2:", dot_score_2)
# Cosine-Similarity 2: tensor([[0.1031]]) dot product 2: tensor([[0.1031]])

6. What else can you do with high dimensional text embeddings?

This tutorial has considered how to calculate and use cosine similarity between high dimensional vectors (text embeddings) to determine a measure of similarity between the meaning of those pieces of text. Word and sentence embeddings can be useful in other contexts too, such as in clustering and in visualization. Clustering and visualization are covered very briefly below.

6.1 Clustering of embeddings

Clustering is a set of unsupervised methods which group and detect patterns in data. The text embeddings can be clustered into classes for unsupervised learning, without the need to have those vectors labelled as belonging to a certain class.

There are two main types of clustering algorithm: agglomerative clustering links similar data points together, whereas centroidal clustering finds centres or partitions in the data [17] https://www.scikit-yb.org/en/latest/api/cluster/index.html

6.2 Dimensionality Reduction and Visualization of text embeddings

Clustering can be performed on high dimensional text embeddings (as vectors) but may perform better if the vectors are transformed into a lower dimensional space, whilst retaining the dimensions containing the most information. This dimensionality reduction can be carried out for example via PCA (principal component analysis) or TSNE (t-distributed stochastic neighbour embedding)

The vectors or clusters of vectors cannot be visualized unless the dimension of the vectors is reduced to three or less and will be covered within separate tutorials.

We hope you find this tutorial useful, please share it if you do!