Retail Product Recommendation with Negative Implicit Feedback
A tutorial to demonstrate the process of training and evaluating various recommender models on a online retail store data. Along with the positive feedbacks like view, add-to-cart, we also have a negative event 'remove-from-cart'.
- Data Loading
- Wrangling
- Exploration
- Rule-based Approaches
- Data Transformation
- Train Test Split
- Experiments
df = pd.read_csv('rawdata.csv', header = 0,
names = ['event','userid','itemid','timestamp'],
dtype={0:'category', 1:'category', 2:'category'},
parse_dates=['timestamp'])
df.head()
df.info()
df = df.drop_duplicates()
userid_encoder = preprocessing.LabelEncoder()
df.userid = userid_encoder.fit_transform(df.userid)
# itemid normalization
itemid_encoder = preprocessing.LabelEncoder()
df.itemid = itemid_encoder.fit_transform(df.itemid)
df.describe().T
df.describe(exclude='int').T
df.timestamp.max() - df.timestamp.min()
df.event.value_counts()
df.event.value_counts()/df.userid.nunique()
def top_trending(n, timeperiod, timestamp):
start = str(timestamp.replace(microsecond=0) - pd.Timedelta(minutes=timeperiod))
end = str(timestamp.replace(microsecond=0))
trending_items = df.loc[(df.timestamp.between(start,end) & (df.event=='view_item')),:].sort_values('timestamp', ascending=False)
return trending_items.itemid.value_counts().index[:n]
user_current_time = df.timestamp[100]
top_trending(5, 50, user_current_time)
def least_n_items(n=10):
temp1 = df.loc[df.event=='view_item'].groupby(['itemid'])['event'].count().sort_values(ascending=True).reset_index()
temp2 = df.groupby('itemid').timestamp.max().reset_index()
item_ids = pd.merge(temp1,temp2,on='itemid').sort_values(['event', 'timestamp'], ascending=[True, False]).reset_index().loc[:n-1,'itemid']
return itemid_encoder.inverse_transform(item_ids.values)
least_n_items(10)
Many times there are no explicit ratings or preferences given by users, that is, the interactions are usually implicit. This information may reflect users' preference towards the items in an implicit manner.
Option 1 - Simple Count: The most simple technique is to count times of interactions between user and item for producing affinity scores.
Option 2 - Weighted Count: It is useful to consider the types of different interactions as weights in the count aggregation. For example, assuming weights of the three differen types, "click", "add", and "purchase", are 1, 2, and 3, respectively.
Option 3 - Time-dependent Count: In many scenarios, time dependency plays a critical role in preparing dataset for building a collaborative filtering model that captures user interests drift over time. One of the common techniques for achieving time dependent count is to add a time decay factor in the counting.
data_count = df.groupby(['userid', 'itemid']).agg({'timestamp': 'count'}).reset_index()
data_count.columns = ['userid', 'itemid', 'affinity']
data_count.head()
data_w['weight'] = data_w['event'].apply(lambda x: affinity_weights[x])
data_wcount = data_w.groupby(['userid', 'itemid'])['weight'].sum().reset_index()
data_wcount.columns = ['userid', 'itemid', 'affinity']
data_wcount.head()
data_wt = data_w.groupby(['userid', 'itemid'])['timedecay'].sum().reset_index()
data_wt.columns = ['userid', 'itemid', 'affinity']
data_wt.head()
Option 1 - Random Split: Random split simply takes in a data set and outputs the splits of the data, given the split ratios
Option 2 - Chronological split: Chronogically splitting method takes in a dataset and splits it on timestamp
data = data_w[['userid','itemid','timedecay','timestamp']]
col = {
'col_user': 'userid',
'col_item': 'itemid',
'col_rating': 'timedecay',
'col_timestamp': 'timestamp',
}
col3 = {
'col_user': 'userid',
'col_item': 'itemid',
'col_timestamp': 'timestamp',
}
train, test = python_chrono_split(data, ratio=0.75, min_rating=10,
filter_by='user', **col3)
train.loc[train.userid==7,:]
test.loc[test.userid==7,:]
baseline_recommendations = pd.merge(item_counts, users_items_remove_seen,
on=['itemid'], how='inner')
baseline_recommendations.head()
print("MAP:\t%f" % eval_map,
"NDCG@K:\t%f" % eval_ndcg,
"Precision@K:\t%f" % eval_precision,
"Recall@K:\t%f" % eval_recall, sep='\n')
all_predictions.head()
from reco_utils.recommender.sar.sar_singlenode import SARSingleNode
TOP_K = 10
header = {
"col_user": "userid",
"col_item": "itemid",
"col_rating": "timedecay",
"col_timestamp": "timestamp",
"col_prediction": "prediction",
}
model = SARSingleNode(
similarity_type="jaccard",
time_decay_coefficient=0,
time_now=None,
timedecay_formula=False,
**header
)
model.fit(train)
interactions = Interactions(user_ids = df.userid.astype('int32').values,
item_ids = df.itemid.astype('int32').values,
timestamps = df.timestamp.astype('int32'),
num_users = df.userid.nunique(),
num_items = df.itemid.nunique())
train_user, test_user = random_train_test_split(interactions, test_percentage=0.2)
model = ImplicitFactorizationModel(loss='bpr', embedding_dim=64, n_iter=10,
batch_size=256, l2=0.0, learning_rate=0.01,
optimizer_func=None, use_cuda=False,
representation=None, sparse=False,
num_negative_samples=10)
model.fit(train_user, verbose=1)
pr = precision_recall_score(model, test=test_user, train=train_user, k=10)
print('Pricison@10 is {:.3f} and Recall@10 is {:.3f}'.format(pr[0].mean(), pr[1].mean()))
Implicit Factorization Model with Grid Search
interactions = Interactions(user_ids = df.userid.astype('int32').values,
item_ids = df.itemid.astype('int32').values+1,
timestamps = df.timestamp.astype('int32'))
train, test = random_train_test_split(interactions, test_percentage=0.2)
train_seq = train.to_sequence(max_sequence_length=10)
test_seq = test.to_sequence(max_sequence_length=10)
model = ImplicitSequenceModel(loss='bpr', representation='pooling',
embedding_dim=32, n_iter=10, batch_size=256,
l2=0.0, learning_rate=0.01, optimizer_func=None,
use_cuda=False, sparse=False, num_negative_samples=5)
model.fit(train_seq, verbose=1)
mrr_seq = sequence_mrr_score(model, test_seq)
mrr_seq.mean()
learn.fit_one_cycle(1, 5e-6)
learn.summary()
learn.fit(10, 1e-3)