Data gathering and exploration

rating_df = pd.read_csv('ml100k_ratings.csv', sep=',', header=0)
rating_df.head()
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
movie_df = pd.read_csv('ml100k_movies.csv', sep=',', header=0)
movie_df.head()
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy

Neighborhood method

Jaccard Similarity

If we ignore the ratings that the users have given to the movies, and consider the movies that the users have watched, we get a set of movies/users for every user/movie. Think of this formulation as a bipartite graph of users and movies where there is an edge between a user and a movie if a user has watched the movie, the edges have all same weights.

Create a dictionary of movies as keys and values as users that have rated them

Since we have a set of users to characterize each movie, to compute the similarity of two movies, we use Jaccard Index which, for two sets, is the ratio of number of elements in the intersection and number of elements in the union.

def jaccard(movie1, movie2, movie_sets):
    a = movie_sets[movie1]
    b = movie_sets[movie2]
    intersection = float(len(a.intersection(b)))
    return intersection / (len(a) + len(b) - intersection)

Let's explore similarity between some movies, qualitatively. We use the movies dataframe to get the names of the movies via their Ids.

movie_df[movie_df.movieId == 260].title.values[0]
'Star Wars: Episode IV - A New Hope (1977)'
Jaccard distance between 'StarWars:EpisodeIV-ANewHope(1977)' and 'StarWars:EpisodeV-TheEmpireStrikesBack(1980)' is 0.70
Jaccard distance between 'StarWars:EpisodeIV-ANewHope(1977)' and 'StarWars:EpisodeVI-ReturnoftheJedi(1983)' is 0.64
Jaccard distance between 'StarWars:EpisodeIV-ANewHope(1977)' and 'ToyStory(1995)' is 0.40

The movie Star Wars IV has higher similarity score with other Star Wars as compared to Toy Story.

Using the Jaccard Index, we can retrieve top-k similar movies to a given movie. This provides a way to recommend movies of a user which are similar to the movies that the user has watched.

get_similar_movies_jaccard(260, movie_sets)
{'movie': 'Star Wars: Episode IV - A New Hope (1977)',
 'sim_movies': ['Star Wars: Episode IV - A New Hope (1977)',
  'Star Wars: Episode V - The Empire Strikes Back (1980)',
  'Star Wars: Episode VI - Return of the Jedi (1983)',
  'Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)',
  'Matrix, The (1999)']}
get_similar_movies_jaccard(1, movie_sets)
{'movie': 'Toy Story (1995)',
 'sim_movies': ['Toy Story (1995)',
  'Independence Day (a.k.a. ID4) (1996)',
  'Jurassic Park (1993)',
  'Star Wars: Episode IV - A New Hope (1977)',
  'Forrest Gump (1994)']}

Cosine similarity

Rather than the set based similarity like Jaccard, we can define every movie as a sparse vector of dimension equal to the number of users and the vector entry corresponding to each user is given by the rating that the user has for the movie or zero if no rating exists (i.e. the user hasn't seen/rated the movie).

print(1.0 - cosine(movie_sparse_vecs[224], movie_sparse_vecs[897]))
0.8324073552233735

def get_similar_movies_nbd_cosine(movieid, movie_vecs, top_n=5):
    movie = movie_df[movie_df.movieId == movieid].title.values[0]
    movie_idx = movie2id[movieid]
    query = movie_vecs[movie_idx].reshape(1,-1)
    ranking = cosine_similarity(movie_vecs,query)
    top_ids = np.argsort(ranking, axis=0)
    top_ids = top_ids[::-1][:top_n]
    top_movie_ids = [movies[j[0]] for j in top_ids]
    sim_movies = [movie_df[movie_df.movieId == id].title.values[0] for id in top_movie_ids]
    return {'movie': movie, 'sim_movies': sim_movies}
movieid = 1
movie_data = movie_sparse_vecs
get_similar_movies_nbd_cosine(movieid, movie_data, top_n=5)
{'movie': 'Toy Story (1995)',
 'sim_movies': ['Toy Story (1995)',
  'Toy Story 2 (1999)',
  'Jurassic Park (1993)',
  'Independence Day (a.k.a. ID4) (1996)',
  'Star Wars: Episode IV - A New Hope (1977)']}
movieid = 260
movie_data = movie_sparse_vecs
get_similar_movies_nbd_cosine(movieid, movie_data, top_n=5)
{'movie': 'Star Wars: Episode IV - A New Hope (1977)',
 'sim_movies': ['Star Wars: Episode IV - A New Hope (1977)',
  'Star Wars: Episode V - The Empire Strikes Back (1980)',
  'Star Wars: Episode VI - Return of the Jedi (1983)',
  'Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)',
  'Matrix, The (1999)']}

Factorization method

Singular Value Decomposition

A very popular technique for recommendation systems is via matrix factorization. The idea is to reduce the dimensionality of the data before calculating similar movies/users. We factorize the user-item matrix to obtain the user factors and item factors which are the low-dimensional embeddings such that 'similar' user/items are mapped to 'nearby' points.

This kind of analysis can generate matches that are impossible to find with the techniques discussed above as the latent factors can capture attributes which are hard for raw data to deciper e.g. a latent factor can correspond to the degree to which a movie is female oriented or degree to which there is a slow development of the charcters.

Moreover, the user and the movies are embedded to the same space, which provides a direct way to compute user-movie similarity.

We will use Singular Value Decomposition (SVD) for factorizing the matrix.

Normalize the rating matrix

normalised_mat = ratings_mat - np.asarray([(np.mean(ratings_mat, 1))]).T

The number of the latent-factors is chosen to be 50 i.e. top-50 singular values of the SVD are considered. The choice of the number of latent factors is a hyperparameter of the model, and requires a more sophisticated analysis to tune. We provide no reason for the choice of 50.

n_factors = 50

A = normalised_mat.T / np.sqrt(ratings_mat.shape[0] - 1)
U, S, V = svds(A, n_factors)

print(U.shape, V.shape)
(610, 50) (50, 9724)
movie_factors = V.T
user_factors = U

Instead of representing each movie as a sparse vector of the ratings of all 360,000 possible users for it, after factorizing the matrix each movie will be represented by a 50 dimensional dense vector.

Define a routine to get top-n movies similar to a given movie.

def get_similar_movies_matrix_factorization(data, movieid, top_n=10):
    index = movieid - 1 # Movie id starts from 1
    movie = movie_df[movie_df.movieId == movieid].title.values[0]
    movie_row = data[index].reshape(1,-1)
    similarity = cosine_similarity(movie_row, data)
    sort_indexes = np.argsort(-similarity)[0]
    return {'movie': movie, 'sim_movies': [movie_df[movie_df.movieId == id].title.values[0] for id in sort_indexes[:top_n] + 1]}
movie_id = 260
get_similar_movies_matrix_factorization(movie_factors, movie_id)
{'movie': 'Priest (1994)',
 'sim_movies': ['Priest (1994)',
  'Heidi Fleiss: Hollywood Madam (1995)',
  'Reds (1981)',
  'I Went Down (1997)',
  'Metroland (1997)',
  'Love Serenade (1996)',
  'Cold Fever (Á köldum klaka) (1995)',
  'Suture (1993)',
  'Whole Wide World, The (1996)',
  'Walking and Talking (1996)']}
movie_id = 1
get_similar_movies_matrix_factorization(movie_factors, movie_id)
{'movie': 'Toy Story (1995)',
 'sim_movies': ['Toy Story (1995)',
  'Back to the Future (1985)',
  "Bug's Life, A (1998)",
  'Babe (1995)',
  'Star Wars: Episode IV - A New Hope (1977)',
  'Who Framed Roger Rabbit? (1988)',
  'Mrs. Doubtfire (1993)',
  'When Harry Met Sally... (1989)',
  '101 Dalmatians (One Hundred and One Dalmatians) (1961)',
  'Home Alone (1990)']}

Since the user and movies are in the same space, we can also compute movies similar to a user. A recommendation model can be defined as showing movies similar to the given user.

def get_recommendations_matrix_factorization(userid, user_factors, movie_factors, top_n=10):
    user_vec = user_factors[userid - 1].reshape(1,-1)
    similarity = cosine_similarity(user_vec, movie_factors)
    sort_indexes = np.argsort(-similarity)[0]
    return [movie_df[movie_df.movieId == id].title.values[0] for id in sort_indexes[:top_n] + 1]
top_recos = get_recommendations_matrix_factorization(1, user_factors, movie_factors)
top_recos
['Jungle2Jungle (a.k.a. Jungle 2 Jungle) (1997)',
 "Pete's Dragon (2016)",
 'Cellular (2004)',
 'Replacement Killers, The (1998)',
 'Rough Night (2017)',
 'Star Wars: Episode III - Revenge of the Sith (2005)',
 'Gentlemen Prefer Blondes (1953)',
 'Spanglish (2004)',
 'Sorry to Bother You (2018)',
 'Planet of the Apes (2001)']

Graph Embedding method

Create a user-movie graph with edge weights as the ratings. We will use DeepWalk to embed every node of the graph to a low-dimensional space.

user_item_edgelist = rating_df[['userId', 'movieId', 'rating']]
user_item_edgelist.head()
userId movieId rating
0 1 1 4.0
1 1 3 4.0
2 1 6 4.0
3 1 44 5.0
4 1 47 5.0

Create a user-movie weighted graph using python library networkx.

user_movie_graph = nx.Graph()
for x in user_item_edgelist.values:
    usr = (x[0], 'user')
    movie = (x[1], 'movie')
    user_movie_graph.add_node(user2dict[usr])
    user_movie_graph.add_node(movie2dict[movie])
    user_movie_graph.add_edge(user2dict[usr], movie2dict[movie], weight=float(x[2]))
user_movie_graph.number_of_edges()
100836
user_movie_graph.number_of_nodes()
10334

DeepWalk

We will use the implementation of DeepWalk provided in node2vec which is a bit different from original DeepWalk e.g. it uses negative sampling whereas the original DeepWalk paper used hierarchical sampling for the skip-gram model.

To create embeddings from the context and non-context pairs, we are using Gensim python library. One can easily use Google word2vec or Facebook fasttext for this task.

import numpy as np
import networkx as nx
import random


class Graph():
	def __init__(self, nx_G, is_directed, p, q):
		self.G = nx_G
		self.is_directed = is_directed
		self.p = p
		self.q = q

	def node2vec_walk(self, walk_length, start_node):
		'''
		Simulate a random walk starting from start node.
		'''
		G = self.G
		alias_nodes = self.alias_nodes
		alias_edges = self.alias_edges

		walk = [start_node]

		while len(walk) < walk_length:
			cur = walk[-1]
			cur_nbrs = sorted(G.neighbors(cur))
			if len(cur_nbrs) > 0:
				if len(walk) == 1:
					walk.append(cur_nbrs[alias_draw(alias_nodes[cur][0], alias_nodes[cur][1])])
				else:
					prev = walk[-2]
					next = cur_nbrs[alias_draw(alias_edges[(prev, cur)][0], 
						alias_edges[(prev, cur)][1])]
					walk.append(next)
			else:
				break

		return walk

	def simulate_walks(self, num_walks, walk_length):
		'''
		Repeatedly simulate random walks from each node.
		'''
		G = self.G
		walks = []
		nodes = list(G.nodes())
		print('Walk iteration:')
		for walk_iter in range(num_walks):
			print(str(walk_iter+1), '/', str(num_walks))
			random.shuffle(nodes)
			for node in nodes:
				walks.append(self.node2vec_walk(walk_length=walk_length, start_node=node))

		return walks

	def get_alias_edge(self, src, dst):
		'''
		Get the alias edge setup lists for a given edge.
		'''
		G = self.G
		p = self.p
		q = self.q

		unnormalized_probs = []
		for dst_nbr in sorted(G.neighbors(dst)):
			if dst_nbr == src:
				unnormalized_probs.append(G[dst][dst_nbr]['weight']/p)
			elif G.has_edge(dst_nbr, src):
				unnormalized_probs.append(G[dst][dst_nbr]['weight'])
			else:
				unnormalized_probs.append(G[dst][dst_nbr]['weight']/q)
		norm_const = sum(unnormalized_probs)
		try:
			normalized_probs =  [float(u_prob)/norm_const for u_prob in unnormalized_probs]
		except:
			normalized_probs =  [0.0 for u_prob in unnormalized_probs]

		return alias_setup(normalized_probs)

	def preprocess_transition_probs(self):
		'''
		Preprocessing of transition probabilities for guiding the random walks.
		'''
		G = self.G
		is_directed = self.is_directed

		alias_nodes = {}
		for node in G.nodes():
			unnormalized_probs = [G[node][nbr]['weight'] for nbr in sorted(G.neighbors(node))]
			norm_const = sum(unnormalized_probs)
			try:
				normalized_probs =  [float(u_prob)/norm_const for u_prob in unnormalized_probs]
			except:
				print(node)
				normalized_probs =  [0.0 for u_prob in unnormalized_probs]
			alias_nodes[node] = alias_setup(normalized_probs)

		alias_edges = {}
		triads = {}

		if is_directed:
			for edge in G.edges():
				alias_edges[edge] = self.get_alias_edge(edge[0], edge[1])
		else:
			for edge in G.edges():
				alias_edges[edge] = self.get_alias_edge(edge[0], edge[1])
				alias_edges[(edge[1], edge[0])] = self.get_alias_edge(edge[1], edge[0])

		self.alias_nodes = alias_nodes
		self.alias_edges = alias_edges

		return


def alias_setup(probs):
	'''
	Compute utility lists for non-uniform sampling from discrete distributions.
	Refer to https://hips.seas.harvard.edu/blog/2013/03/03/the-alias-method-efficient-sampling-with-many-discrete-outcomes/
	for details
	'''
	K = len(probs)
	q = np.zeros(K)
	J = np.zeros(K, dtype=np.int)

	smaller = []
	larger = []
	for kk, prob in enumerate(probs):
	    q[kk] = K*prob
	    if q[kk] < 1.0:
	        smaller.append(kk)
	    else:
	        larger.append(kk)

	while len(smaller) > 0 and len(larger) > 0:
	    small = smaller.pop()
	    large = larger.pop()

	    J[small] = large
	    q[large] = q[large] + q[small] - 1.0
	    if q[large] < 1.0:
	        smaller.append(large)
	    else:
	        larger.append(large)

	return J, q

def alias_draw(J, q):
	'''
	Draw sample from a non-uniform discrete distribution using alias sampling.
	'''
	K = len(J)

	kk = int(np.floor(np.random.rand()*K))
	if np.random.rand() < q[kk]:
	    return kk
	else:
	    return J[kk]
G = Graph(user_movie_graph, is_directed=False, p=1, q=1)

p,q = 1 for DeeWalk as the random walks are completely unbiased.

G.preprocess_transition_probs()

Compute the random walks.

  • 10 walks for every node.
  • Each walk of length 80.
walks = G.simulate_walks(num_walks=10, walk_length=80)
Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
len(walks)
103340

Learn Embeddings via Gensim, which creates context/non-context pairs and then Skip-gram.

def learn_embeddings(walks):
    '''
    Learn embeddings by optimizing the Skipgram objective using SGD.
    Uses Gensim Word2Vec.
    '''
    walks = [list(map(str, walk)) for walk in walks]
    model = Word2Vec(walks, size=50, window=10, min_count=0, sg=1, workers=8, iter=1)
    return model.wv
node_embeddings = learn_embeddings(walks)

The output of gensim is a specific type of key-value pair with keys as the string-ed node ids and the values are numpy array of embeddings, each of shape (50,)

node_embeddings['0']
array([-2.1159494e-02, -4.3432057e-01,  6.9623584e-01,  4.8146245e-01,
        9.6677847e-02, -3.1050149e-02,  1.9733864e-01,  7.9867625e-01,
       -6.6979128e-01,  6.5312237e-01,  3.1304079e-01, -1.3411559e-01,
       -1.5454048e-01, -2.7333325e-01,  1.4711864e-01,  2.2469629e-01,
       -4.3890166e-01,  2.9871342e-01, -6.9798152e-03, -1.7996507e-02,
       -6.7855030e-02, -4.3489739e-01,  1.5584855e-01,  7.7486165e-02,
        3.7617716e-01,  2.8012756e-01, -8.3905622e-02, -3.5362533e-01,
        2.2293477e-01, -1.4108117e-01,  1.6970167e-01, -6.3179672e-01,
        1.5170584e-02,  6.1733756e-02, -1.2013953e-01, -2.3064958e-01,
        2.4610328e-02,  5.0556450e-04,  2.1398006e-01, -1.0361964e-01,
        4.8838145e-01, -3.6318046e-01, -3.1330651e-01,  8.6576389e-03,
        1.9654050e-02, -4.6078888e-01,  2.7895319e-01, -1.7853497e-04,
       -1.7593203e-01,  2.8144377e-01], dtype=float32)
movie1 = str(movie2dict[(260, 'movie')])
movie2 = str(movie2dict[(1196, 'movie')])
1.0 - cosine(node_embeddings[movie1], node_embeddings[movie2])
0.42126354575157166
movie3 = str(movie2dict[(1210, 'movie')])
1.0 - cosine(node_embeddings[movie1], node_embeddings[movie3])
0.29301050305366516
movie4 = str(movie2dict[(1, 'movie')])
1.0 - cosine(node_embeddings[movie1], node_embeddings[movie4])
0.5913276672363281

Since we worked with integer ids for nodes, let's create reverse mapping dictionaries that map integer user/movie to their actual ids.

reverse_movie2dict = {k:v for v,k in movie2dict.items()}
reverse_user2dict = {k:v for v,k in user2dict.items()}
node_vecs = [node_embeddings[str(i)] for i in range(cnt)]
node_vecs = np.array(node_vecs)
node_vecs.shape
(10334, 50)

Movies similar to a given movie as an evaluation of the system.

def get_similar_movies_graph_embeddings(movieid, movie_embed, top_n=10):
    movie_idx = movie2dict[movieid]
    query = movie_embed[movie_idx].reshape(1,-1)
    ranking = cosine_similarity(query, movie_embed)
    top_ids = np.argsort(-ranking)[0]
    top_movie_ids = [reverse_movie2dict[j] for j in top_ids if j in reverse_movie2dict][:top_n]
    sim_movies = [movie_df[movie_df.movieId == id[0]].title.values[0] for id in top_movie_ids]
    return sim_movies
get_similar_movies_graph_embeddings((260, 'movie'), node_vecs)[:10]
['Priest (1994)',
 'Heidi Fleiss: Hollywood Madam (1995)',
 'My Crazy Life (Mi vida loca) (1993)',
 'Before the Rain (Pred dozhdot) (1994)',
 'Boys of St. Vincent, The (1992)',
 'Last Dance (1996)',
 'Awfully Big Adventure, An (1995)',
 'Queen Margot (Reine Margot, La) (1994)',
 "Widows' Peak (1994)",
 'What Happened Was... (1994)']
get_similar_movies_graph_embeddings((122, 'movie'), node_vecs)[:10]
['Awfully Big Adventure, An (1995)',
 'What Happened Was... (1994)',
 'Song of the Little Road (Pather Panchali) (1955)',
 'Death and the Maiden (1994)',
 'Heidi Fleiss: Hollywood Madam (1995)',
 'My Crazy Life (Mi vida loca) (1993)',
 'Love & Human Remains (1993)',
 "It's My Party (1996)",
 'Perez Family, The (1995)',
 'Bitter Moon (1992)']

We can also define the recommendation model based on the cosine similarity i.e the movies are ranked for a given user in terms of the cosine similarities of their corresponding embeddings with the embedding of the user.

def get_recommended_movies_graph_embeddings(userid, node_embed, top_n=10):
    user_idx = user2dict[userid]
    query = node_embed[user_idx].reshape(1,-1)
    ranking = cosine_similarity(query, node_embed)
    top_ids = np.argsort(-ranking)[0]
    top_movie_ids = [reverse_movie2dict[j] for j in top_ids if j in reverse_movie2dict][:top_n]
    reco_movies = [movie_df[movie_df.movieId == id[0]].title.values[0] for id in top_movie_ids]
    return reco_movies
get_recommended_movies_graph_embeddings((1, 'user'), node_vecs, top_n=10)
['Best Men (1997)',
 'Newton Boys, The (1998)',
 'Howard the Duck (1986)',
 "Gulliver's Travels (1939)",
 'Shaft (1971)',
 'Teenage Mutant Ninja Turtles III (1993)',
 'Welcome to Woop-Woop (1997)',
 'Song of the South (1946)',
 'Three Caballeros, The (1945)',
 'Lord of the Rings, The (1978)']

Evaluation

As another evalution, let's compare the generated recommendation for a user to the movies tnat the user has actually rated highly. We will get top 10 recommendations for a user, ranked by the cosine similarity, and compute how many of these movies comes from the set of the movies that the user has rated >= 4.5. This tantamounts to Precision@10 metric. For comparison, we will also compute the Precision for the recommendations produced by the matrix factorization model.

idx = 1
recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])
recos.intersection(true_pos)
{"Gulliver's Travels (1939)",
 'Lord of the Rings, The (1978)',
 'Newton Boys, The (1998)',
 'Shaft (1971)',
 'Three Caballeros, The (1945)'}
mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
mf_recos.intersection(true_pos)
set()
idx = 2
recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])
recos.intersection(true_pos)
{'Dark Knight, The (2008)',
 'Inglourious Basterds (2009)',
 'The Jinx: The Life and Deaths of Robert Durst (2015)',
 'Warrior (2011)',
 'Wolf of Wall Street, The (2013)'}
mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
mf_recos.intersection(true_pos)
set()
idx = 3
recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])
recos.intersection(true_pos)
{'Alien Contamination (1980)',
 'Android (1982)',
 'Clonus Horror, The (1979)',
 'Death Race 2000 (1975)',
 'Galaxy of Terror (Quest) (1981)',
 'Hangar 18 (1980)',
 'Looker (1981)',
 'Master of the Flying Guillotine (Du bi quan wang da po xue di zi) (1975)',
 'Saturn 3 (1980)',
 'The Lair of the White Worm (1988)'}
mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
mf_recos.intersection(true_pos)
set()

Enriched network with additional information : Genres

Genres of the movies can be used as additional signal for better recommendations

movie_genre_edgelist = movie_df[['movieId', 'genres']]
movie_genre_edgelist.head()
movieId genres
0 1 Adventure|Animation|Children|Comedy|Fantasy
1 2 Adventure|Children|Fantasy
2 3 Comedy|Romance
3 4 Comedy|Drama|Romance
4 5 Comedy
genre2int
{'(no genres listed)': 10353,
 'Action': 10341,
 'Adventure': 10334,
 'Animation': 10335,
 'Children': 10336,
 'Comedy': 10337,
 'Crime': 10342,
 'Documentary': 10349,
 'Drama': 10340,
 'Fantasy': 10338,
 'Film-Noir': 10352,
 'Horror': 10344,
 'IMAX': 10350,
 'Musical': 10348,
 'Mystery': 10345,
 'Romance': 10339,
 'Sci-Fi': 10346,
 'Thriller': 10343,
 'War': 10347,
 'Western': 10351}

Combine the user-movie and movie-genre graph

user_movie_genre_graph =  nx.Graph()
user_movie_genre_graph.add_weighted_edges_from([(x,y,user_movie_graph[x][y]['weight']) for x,y in user_movie_graph.edges()])
user_movie_genre_graph.add_weighted_edges_from([(x,y,movie_genre_graph[x][y]['weight']) for x,y in movie_genre_graph.edges()])
user_movie_genre_graph.number_of_edges()
122882
G_enriched = Graph(user_movie_genre_graph, is_directed=False, p=1, q=1)
G_enriched.preprocess_transition_probs()
walks_enriched = G_enriched.simulate_walks(num_walks=10, walk_length=80)
Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
node_embeddings_enriched = learn_embeddings(walks_enriched)
node_vecs_enriched = [node_embeddings_enriched[str(i)] for i in range(cnt)]
node_vecs_enriched = np.array(node_vecs_enriched)
node_vecs_enriched.shape
(10354, 50)
get_similar_movies_graph_embeddings((260, 'movie'), node_vecs_enriched)[:10]
['Priest (1994)',
 'Mrs. Parker and the Vicious Circle (1994)',
 'Last Dance (1996)',
 'Tom & Viv (1994)',
 'Georgia (1995)',
 'My Crazy Life (Mi vida loca) (1993)',
 'Before the Rain (Pred dozhdot) (1994)',
 'Haunted World of Edward D. Wood Jr., The (1996)',
 'To Live (Huozhe) (1994)',
 'Vanya on 42nd Street (1994)']
get_similar_movies_graph_embeddings((260, 'movie'), node_vecs)[:10]
['Priest (1994)',
 'Heidi Fleiss: Hollywood Madam (1995)',
 'My Crazy Life (Mi vida loca) (1993)',
 'Before the Rain (Pred dozhdot) (1994)',
 'Boys of St. Vincent, The (1992)',
 'Last Dance (1996)',
 'Awfully Big Adventure, An (1995)',
 'Queen Margot (Reine Margot, La) (1994)',
 "Widows' Peak (1994)",
 'What Happened Was... (1994)']
idx = 1
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])

mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
print(len(mf_recos.intersection(true_pos)))

ge_recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
print(len(ge_recos.intersection(true_pos)))

ge_enriched_reso = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs_enriched, top_n=10))
print(len(ge_enriched_reso.intersection(true_pos)))
0
5
5
idx = 8
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])

mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
print(len(mf_recos.intersection(true_pos)))

ge_recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
print(len(ge_recos.intersection(true_pos)))

ge_enriched_reso = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs_enriched, top_n=10))
print(len(ge_enriched_reso.intersection(true_pos)))
0
2
1
idx = 20
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])

mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
print(len(mf_recos.intersection(true_pos)))

ge_recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
print(len(ge_recos.intersection(true_pos)))

ge_enriched_reso = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs_enriched, top_n=10))
print(len(ge_enriched_reso.intersection(true_pos)))
0
0
1