Recommender System with Node2vec Graph Embeddings
A tutorial on building a movie recommender system that will learn user-item representation using graph embedding and comparing performance with other methods like matrix factorization
- Data gathering and exploration
- Neighborhood method
- Factorization method
- Graph Embedding method
- Enriched network with additional information : Genres
rating_df = pd.read_csv('ml100k_ratings.csv', sep=',', header=0)
rating_df.head()
movie_df = pd.read_csv('ml100k_movies.csv', sep=',', header=0)
movie_df.head()
If we ignore the ratings that the users have given to the movies, and consider the movies that the users have watched, we get a set of movies/users for every user/movie. Think of this formulation as a bipartite graph of users and movies where there is an edge between a user and a movie if a user has watched the movie, the edges have all same weights.
Create a dictionary of movies as keys and values as users that have rated them
Since we have a set of users to characterize each movie, to compute the similarity of two movies, we use Jaccard Index which, for two sets, is the ratio of number of elements in the intersection and number of elements in the union.
def jaccard(movie1, movie2, movie_sets):
a = movie_sets[movie1]
b = movie_sets[movie2]
intersection = float(len(a.intersection(b)))
return intersection / (len(a) + len(b) - intersection)
Let's explore similarity between some movies, qualitatively. We use the movies dataframe to get the names of the movies via their Ids.
movie_df[movie_df.movieId == 260].title.values[0]
The movie Star Wars IV has higher similarity score with other Star Wars as compared to Toy Story.
Using the Jaccard Index, we can retrieve top-k similar movies to a given movie. This provides a way to recommend movies of a user which are similar to the movies that the user has watched.
get_similar_movies_jaccard(260, movie_sets)
get_similar_movies_jaccard(1, movie_sets)
Rather than the set based similarity like Jaccard, we can define every movie as a sparse vector of dimension equal to the number of users and the vector entry corresponding to each user is given by the rating that the user has for the movie or zero if no rating exists (i.e. the user hasn't seen/rated the movie).
print(1.0 - cosine(movie_sparse_vecs[224], movie_sparse_vecs[897]))
def get_similar_movies_nbd_cosine(movieid, movie_vecs, top_n=5):
movie = movie_df[movie_df.movieId == movieid].title.values[0]
movie_idx = movie2id[movieid]
query = movie_vecs[movie_idx].reshape(1,-1)
ranking = cosine_similarity(movie_vecs,query)
top_ids = np.argsort(ranking, axis=0)
top_ids = top_ids[::-1][:top_n]
top_movie_ids = [movies[j[0]] for j in top_ids]
sim_movies = [movie_df[movie_df.movieId == id].title.values[0] for id in top_movie_ids]
return {'movie': movie, 'sim_movies': sim_movies}
movieid = 1
movie_data = movie_sparse_vecs
get_similar_movies_nbd_cosine(movieid, movie_data, top_n=5)
movieid = 260
movie_data = movie_sparse_vecs
get_similar_movies_nbd_cosine(movieid, movie_data, top_n=5)
A very popular technique for recommendation systems is via matrix factorization. The idea is to reduce the dimensionality of the data before calculating similar movies/users. We factorize the user-item matrix to obtain the user factors and item factors which are the low-dimensional embeddings such that 'similar' user/items are mapped to 'nearby' points.
This kind of analysis can generate matches that are impossible to find with the techniques discussed above as the latent factors can capture attributes which are hard for raw data to deciper e.g. a latent factor can correspond to the degree to which a movie is female oriented or degree to which there is a slow development of the charcters.
Moreover, the user and the movies are embedded to the same space, which provides a direct way to compute user-movie similarity.
We will use Singular Value Decomposition (SVD) for factorizing the matrix.
Normalize the rating matrix
normalised_mat = ratings_mat - np.asarray([(np.mean(ratings_mat, 1))]).T
The number of the latent-factors is chosen to be 50 i.e. top-50 singular values of the SVD are considered. The choice of the number of latent factors is a hyperparameter of the model, and requires a more sophisticated analysis to tune. We provide no reason for the choice of 50.
n_factors = 50
A = normalised_mat.T / np.sqrt(ratings_mat.shape[0] - 1)
U, S, V = svds(A, n_factors)
print(U.shape, V.shape)
movie_factors = V.T
user_factors = U
Instead of representing each movie as a sparse vector of the ratings of all 360,000 possible users for it, after factorizing the matrix each movie will be represented by a 50 dimensional dense vector.
Define a routine to get top-n movies similar to a given movie.
def get_similar_movies_matrix_factorization(data, movieid, top_n=10):
index = movieid - 1 # Movie id starts from 1
movie = movie_df[movie_df.movieId == movieid].title.values[0]
movie_row = data[index].reshape(1,-1)
similarity = cosine_similarity(movie_row, data)
sort_indexes = np.argsort(-similarity)[0]
return {'movie': movie, 'sim_movies': [movie_df[movie_df.movieId == id].title.values[0] for id in sort_indexes[:top_n] + 1]}
movie_id = 260
get_similar_movies_matrix_factorization(movie_factors, movie_id)
movie_id = 1
get_similar_movies_matrix_factorization(movie_factors, movie_id)
Since the user and movies are in the same space, we can also compute movies similar to a user. A recommendation model can be defined as showing movies similar to the given user.
def get_recommendations_matrix_factorization(userid, user_factors, movie_factors, top_n=10):
user_vec = user_factors[userid - 1].reshape(1,-1)
similarity = cosine_similarity(user_vec, movie_factors)
sort_indexes = np.argsort(-similarity)[0]
return [movie_df[movie_df.movieId == id].title.values[0] for id in sort_indexes[:top_n] + 1]
top_recos = get_recommendations_matrix_factorization(1, user_factors, movie_factors)
top_recos
Create a user-movie graph with edge weights as the ratings. We will use DeepWalk to embed every node of the graph to a low-dimensional space.
user_item_edgelist = rating_df[['userId', 'movieId', 'rating']]
user_item_edgelist.head()
Create a user-movie weighted graph using python library networkx.
user_movie_graph = nx.Graph()
for x in user_item_edgelist.values:
usr = (x[0], 'user')
movie = (x[1], 'movie')
user_movie_graph.add_node(user2dict[usr])
user_movie_graph.add_node(movie2dict[movie])
user_movie_graph.add_edge(user2dict[usr], movie2dict[movie], weight=float(x[2]))
user_movie_graph.number_of_edges()
user_movie_graph.number_of_nodes()
We will use the implementation of DeepWalk provided in node2vec which is a bit different from original DeepWalk e.g. it uses negative sampling whereas the original DeepWalk paper used hierarchical sampling for the skip-gram model.
To create embeddings from the context and non-context pairs, we are using Gensim python library. One can easily use Google word2vec or Facebook fasttext for this task.
import numpy as np
import networkx as nx
import random
class Graph():
def __init__(self, nx_G, is_directed, p, q):
self.G = nx_G
self.is_directed = is_directed
self.p = p
self.q = q
def node2vec_walk(self, walk_length, start_node):
'''
Simulate a random walk starting from start node.
'''
G = self.G
alias_nodes = self.alias_nodes
alias_edges = self.alias_edges
walk = [start_node]
while len(walk) < walk_length:
cur = walk[-1]
cur_nbrs = sorted(G.neighbors(cur))
if len(cur_nbrs) > 0:
if len(walk) == 1:
walk.append(cur_nbrs[alias_draw(alias_nodes[cur][0], alias_nodes[cur][1])])
else:
prev = walk[-2]
next = cur_nbrs[alias_draw(alias_edges[(prev, cur)][0],
alias_edges[(prev, cur)][1])]
walk.append(next)
else:
break
return walk
def simulate_walks(self, num_walks, walk_length):
'''
Repeatedly simulate random walks from each node.
'''
G = self.G
walks = []
nodes = list(G.nodes())
print('Walk iteration:')
for walk_iter in range(num_walks):
print(str(walk_iter+1), '/', str(num_walks))
random.shuffle(nodes)
for node in nodes:
walks.append(self.node2vec_walk(walk_length=walk_length, start_node=node))
return walks
def get_alias_edge(self, src, dst):
'''
Get the alias edge setup lists for a given edge.
'''
G = self.G
p = self.p
q = self.q
unnormalized_probs = []
for dst_nbr in sorted(G.neighbors(dst)):
if dst_nbr == src:
unnormalized_probs.append(G[dst][dst_nbr]['weight']/p)
elif G.has_edge(dst_nbr, src):
unnormalized_probs.append(G[dst][dst_nbr]['weight'])
else:
unnormalized_probs.append(G[dst][dst_nbr]['weight']/q)
norm_const = sum(unnormalized_probs)
try:
normalized_probs = [float(u_prob)/norm_const for u_prob in unnormalized_probs]
except:
normalized_probs = [0.0 for u_prob in unnormalized_probs]
return alias_setup(normalized_probs)
def preprocess_transition_probs(self):
'''
Preprocessing of transition probabilities for guiding the random walks.
'''
G = self.G
is_directed = self.is_directed
alias_nodes = {}
for node in G.nodes():
unnormalized_probs = [G[node][nbr]['weight'] for nbr in sorted(G.neighbors(node))]
norm_const = sum(unnormalized_probs)
try:
normalized_probs = [float(u_prob)/norm_const for u_prob in unnormalized_probs]
except:
print(node)
normalized_probs = [0.0 for u_prob in unnormalized_probs]
alias_nodes[node] = alias_setup(normalized_probs)
alias_edges = {}
triads = {}
if is_directed:
for edge in G.edges():
alias_edges[edge] = self.get_alias_edge(edge[0], edge[1])
else:
for edge in G.edges():
alias_edges[edge] = self.get_alias_edge(edge[0], edge[1])
alias_edges[(edge[1], edge[0])] = self.get_alias_edge(edge[1], edge[0])
self.alias_nodes = alias_nodes
self.alias_edges = alias_edges
return
def alias_setup(probs):
'''
Compute utility lists for non-uniform sampling from discrete distributions.
Refer to https://hips.seas.harvard.edu/blog/2013/03/03/the-alias-method-efficient-sampling-with-many-discrete-outcomes/
for details
'''
K = len(probs)
q = np.zeros(K)
J = np.zeros(K, dtype=np.int)
smaller = []
larger = []
for kk, prob in enumerate(probs):
q[kk] = K*prob
if q[kk] < 1.0:
smaller.append(kk)
else:
larger.append(kk)
while len(smaller) > 0 and len(larger) > 0:
small = smaller.pop()
large = larger.pop()
J[small] = large
q[large] = q[large] + q[small] - 1.0
if q[large] < 1.0:
smaller.append(large)
else:
larger.append(large)
return J, q
def alias_draw(J, q):
'''
Draw sample from a non-uniform discrete distribution using alias sampling.
'''
K = len(J)
kk = int(np.floor(np.random.rand()*K))
if np.random.rand() < q[kk]:
return kk
else:
return J[kk]
G = Graph(user_movie_graph, is_directed=False, p=1, q=1)
p,q = 1 for DeeWalk as the random walks are completely unbiased.
G.preprocess_transition_probs()
Compute the random walks.
- 10 walks for every node.
- Each walk of length 80.
walks = G.simulate_walks(num_walks=10, walk_length=80)
len(walks)
Learn Embeddings via Gensim, which creates context/non-context pairs and then Skip-gram.
def learn_embeddings(walks):
'''
Learn embeddings by optimizing the Skipgram objective using SGD.
Uses Gensim Word2Vec.
'''
walks = [list(map(str, walk)) for walk in walks]
model = Word2Vec(walks, size=50, window=10, min_count=0, sg=1, workers=8, iter=1)
return model.wv
node_embeddings = learn_embeddings(walks)
The output of gensim is a specific type of key-value pair with keys as the string-ed node ids and the values are numpy array of embeddings, each of shape (50,)
node_embeddings['0']
movie1 = str(movie2dict[(260, 'movie')])
movie2 = str(movie2dict[(1196, 'movie')])
1.0 - cosine(node_embeddings[movie1], node_embeddings[movie2])
movie3 = str(movie2dict[(1210, 'movie')])
1.0 - cosine(node_embeddings[movie1], node_embeddings[movie3])
movie4 = str(movie2dict[(1, 'movie')])
1.0 - cosine(node_embeddings[movie1], node_embeddings[movie4])
Since we worked with integer ids for nodes, let's create reverse mapping dictionaries that map integer user/movie to their actual ids.
reverse_movie2dict = {k:v for v,k in movie2dict.items()}
reverse_user2dict = {k:v for v,k in user2dict.items()}
node_vecs = [node_embeddings[str(i)] for i in range(cnt)]
node_vecs = np.array(node_vecs)
node_vecs.shape
Movies similar to a given movie as an evaluation of the system.
def get_similar_movies_graph_embeddings(movieid, movie_embed, top_n=10):
movie_idx = movie2dict[movieid]
query = movie_embed[movie_idx].reshape(1,-1)
ranking = cosine_similarity(query, movie_embed)
top_ids = np.argsort(-ranking)[0]
top_movie_ids = [reverse_movie2dict[j] for j in top_ids if j in reverse_movie2dict][:top_n]
sim_movies = [movie_df[movie_df.movieId == id[0]].title.values[0] for id in top_movie_ids]
return sim_movies
get_similar_movies_graph_embeddings((260, 'movie'), node_vecs)[:10]
get_similar_movies_graph_embeddings((122, 'movie'), node_vecs)[:10]
We can also define the recommendation model based on the cosine similarity i.e the movies are ranked for a given user in terms of the cosine similarities of their corresponding embeddings with the embedding of the user.
def get_recommended_movies_graph_embeddings(userid, node_embed, top_n=10):
user_idx = user2dict[userid]
query = node_embed[user_idx].reshape(1,-1)
ranking = cosine_similarity(query, node_embed)
top_ids = np.argsort(-ranking)[0]
top_movie_ids = [reverse_movie2dict[j] for j in top_ids if j in reverse_movie2dict][:top_n]
reco_movies = [movie_df[movie_df.movieId == id[0]].title.values[0] for id in top_movie_ids]
return reco_movies
get_recommended_movies_graph_embeddings((1, 'user'), node_vecs, top_n=10)
As another evalution, let's compare the generated recommendation for a user to the movies tnat the user has actually rated highly. We will get top 10 recommendations for a user, ranked by the cosine similarity, and compute how many of these movies comes from the set of the movies that the user has rated >= 4.5. This tantamounts to Precision@10 metric. For comparison, we will also compute the Precision for the recommendations produced by the matrix factorization model.
idx = 1
recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])
recos.intersection(true_pos)
mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
mf_recos.intersection(true_pos)
idx = 2
recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])
recos.intersection(true_pos)
mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
mf_recos.intersection(true_pos)
idx = 3
recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])
recos.intersection(true_pos)
mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
mf_recos.intersection(true_pos)
Genres of the movies can be used as additional signal for better recommendations
movie_genre_edgelist = movie_df[['movieId', 'genres']]
movie_genre_edgelist.head()
genre2int
Combine the user-movie and movie-genre graph
user_movie_genre_graph = nx.Graph()
user_movie_genre_graph.add_weighted_edges_from([(x,y,user_movie_graph[x][y]['weight']) for x,y in user_movie_graph.edges()])
user_movie_genre_graph.add_weighted_edges_from([(x,y,movie_genre_graph[x][y]['weight']) for x,y in movie_genre_graph.edges()])
user_movie_genre_graph.number_of_edges()
G_enriched = Graph(user_movie_genre_graph, is_directed=False, p=1, q=1)
G_enriched.preprocess_transition_probs()
walks_enriched = G_enriched.simulate_walks(num_walks=10, walk_length=80)
node_embeddings_enriched = learn_embeddings(walks_enriched)
node_vecs_enriched = [node_embeddings_enriched[str(i)] for i in range(cnt)]
node_vecs_enriched = np.array(node_vecs_enriched)
node_vecs_enriched.shape
get_similar_movies_graph_embeddings((260, 'movie'), node_vecs_enriched)[:10]
get_similar_movies_graph_embeddings((260, 'movie'), node_vecs)[:10]
idx = 1
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])
mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
print(len(mf_recos.intersection(true_pos)))
ge_recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
print(len(ge_recos.intersection(true_pos)))
ge_enriched_reso = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs_enriched, top_n=10))
print(len(ge_enriched_reso.intersection(true_pos)))
idx = 8
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])
mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
print(len(mf_recos.intersection(true_pos)))
ge_recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
print(len(ge_recos.intersection(true_pos)))
ge_enriched_reso = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs_enriched, top_n=10))
print(len(ge_enriched_reso.intersection(true_pos)))
idx = 20
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])
mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
print(len(mf_recos.intersection(true_pos)))
ge_recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
print(len(ge_recos.intersection(true_pos)))
ge_enriched_reso = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs_enriched, top_n=10))
print(len(ge_enriched_reso.intersection(true_pos)))