Hashing for large scale similarity
Similarity computation is a very common task in real-world machine learning and data mining problems such as recommender systems, spam detection, online advertising etc. Consider a tweet recommendation problem where one has to find tweets similar to the tweet user previously clicked. This problem becomes extremely challenging when there are billions of tweets created each day.
In this post, we will discuss the two most common similarity metric, namely Jaccard similarity and Cosine similarity; and Locality Sensitive Hashing based approximation of those metrics.
Jaccard Similarity
Jaccard similarity is one of the most popular metrics used to compute the similarity between two sets. Given set and , jaccard similarity is defined as
Let’s define tweet as
where each tweet is represented by a set of users who clicked the tweet. The jaccard similarity between and is given by
Cosine similarity
Cosine similarity is a vector space metric where similarity between two vector and is defined as where, is a dot product and is norm operator.
Jaccard similarity ignores the weights as it’s a set based distance metric for e.g. number of user clicks. Lets define
where the number indicates the number of times user clicked tweet . Now, the tweets can be represented in user’s vector space as
and the cosine similarity is given as
Limitations
Pairwise similarity scales quadratically both in terms of time and space complexity. Hence, finding similar items is very challenging for a large number of items. In the following section, we discuss locality sensitive hashing to address these limitations.
Locality Sensitive Hashing (LSH)
Hashing is a very widely used technique that assigns pseudo-random value/bucket to objects. Hash functions must be uniform i.e. each bucket is equally likely. Locality Sensitive Hashing(LSH) is a hashing based dimensionality reduction method that preserves item similarity. More precisely, LSH hashes items to buckets such that similar items map to the same bucket with high probability. Such hash signatures can be used for efficient neighborhood search as well as to compute similarity between items on the fly.
Minhash
First, lets define
Let be a hash function that maps an object to a positive integer. The minhash is defined as
a function that returns the item with smallest hash value. Now, a k-dimensional minhash signature is defned by hash functions
Given , jaccard similarity of item & is defined as
where is an indicator function.
Minhash explained
Although theoretical proof is quite rigorous, the key idea is very intuitive. First, let’s define,
a) How many ways can we shuffle such that is the first element?
Ans: 3!
b) More generally, how many ways can we shuffle such that is the first element?
Ans:
Hashing a set of items is equivalent to generating a random permutation of the set. So, , only if the hash function assigns smallest value to i.e . From this observation, we can conclude
Hence, minhash approximates jaccard similarity.
Simhash
Let be a random vector passing through origin. Let’s define a simhash function for tweet
where,
Given simhash, can be defined as
Now, the angle between and is defined as
Simhash explained
In this section, we discuss the intuition behind approximation of cosine similarity using simhash.
In the figure above, for the vector , the pink shaded area corresponds to the half-space where , for eg. . On the other hand, the white region corresponds to the half-space where , for eg.
Lets consider two vector , and is an angle between them as shown in figure above. For a randomly drawn a vector passing through origin
is true only if vector lies in purple or white shaded area (i.e other than pink and blue shaded area). From this observation, we can define
Hence, simhash approximates cosine similarity.
In the next post, I will discuss more about implementation of minhash and simhash.