Word2Vec, short for “word to vector,” is a technology used to represent the relationships between different words in the form of a graph. This technology is widely used in machine learning for embedding and text analysis.
Google introduced Word2Vec for their search engine and patented the algorithm, along with several following updates, in 2013. This collection of interconnected algorithms was developed by Tomas Mikolov.
In this article, we will explore the notion and mechanics of generating embeddings with Word2Vec.
What is a word embedding?
If you ask someone which word is more similar to “king” — “ruler” or “worker” — most people would say “ruler” makes more sense, right? But how do we teach this intuition to a computer? That’s where word embeddings come in handy.
A word embedding is a representation of a word used in text analysis. It usually takes the form of a vector, which encodes the word’s meaning in such a way that words closer in the vector space are expected to be similar in meaning. Language modeling and feature learning techniques are typically used to obtain word embeddings, where words or phrases from the vocabulary are mapped to vectors of real numbers.
The meaning of a term is determined by its context: the words that come before and after it, which is called the context window. Typically, this window is four words wide, with four words to the left and right of the target term. To create vector representations of words, we look at how often they appear together.
Word embeddings are one of the most fascinating concepts in machine learning. If you’ve ever used virtual assistants like Siri, Google Assistant, or Alexa, or even a smartphone keyboard with predictive text, you’ve already interacted with a natural language processing model based on embeddings.