Data Loading

df = pd.read_csv('rawdata.csv', header = 0,
                 names = ['event','userid','itemid','timestamp'],
                 dtype={0:'category', 1:'category', 2:'category'},
                 parse_dates=['timestamp'])

df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99998 entries, 0 to 99997
Data columns (total 4 columns):
event        99998 non-null category
userid       99998 non-null category
itemid       99998 non-null category
timestamp    99998 non-null datetime64[ns, UTC]
dtypes: category(3), datetime64[ns, UTC](1)
memory usage: 1.7 MB

Wrangling

Removing Duplicates

df = df.drop_duplicates()

Label Encoding

userid_encoder = preprocessing.LabelEncoder()
df.userid = userid_encoder.fit_transform(df.userid)

# itemid normalization
itemid_encoder = preprocessing.LabelEncoder()
df.itemid = itemid_encoder.fit_transform(df.itemid)

Exploration

df.describe().T

df.describe(exclude='int').T

df.timestamp.max() - df.timestamp.min()

Timedelta('56 days 20:56:50.132000')

df.event.value_counts()

begin_checkout      41459
view_item           35397
purchase             9969
add_to_cart          7745
remove_from_cart     4862
Name: event, dtype: int64

df.event.value_counts()/df.userid.nunique()

begin_checkout      3.612355
view_item           3.084168
purchase            0.868607
add_to_cart         0.674828
remove_from_cart    0.423630
Name: event, dtype: float64

User Interactions

Add-to-cart Event Counts

Purchase Event Counts

Item Interactions

Rule-based Approaches

Top-N Trending Products

def top_trending(n, timeperiod, timestamp):
  start = str(timestamp.replace(microsecond=0) - pd.Timedelta(minutes=timeperiod))
  end = str(timestamp.replace(microsecond=0))
  trending_items = df.loc[(df.timestamp.between(start,end) & (df.event=='view_item')),:].sort_values('timestamp', ascending=False)
  return trending_items.itemid.value_counts().index[:n]

user_current_time = df.timestamp[100]
top_trending(5, 50, user_current_time)

Int64Index([2241, 972, 393, 1118, 126], dtype='int64')

Top-N Least Viewed Items

def least_n_items(n=10):
  temp1 = df.loc[df.event=='view_item'].groupby(['itemid'])['event'].count().sort_values(ascending=True).reset_index()
  temp2 = df.groupby('itemid').timestamp.max().reset_index()
  item_ids = pd.merge(temp1,temp2,on='itemid').sort_values(['event', 'timestamp'], ascending=[True, False]).reset_index().loc[:n-1,'itemid']
  return itemid_encoder.inverse_transform(item_ids.values)

least_n_items(10)

array(['15742', '16052', '16443', '16074', '16424', '11574', '11465', '16033', '11711', '16013'], dtype=object)

Data Transformation

Many times there are no explicit ratings or preferences given by users, that is, the interactions are usually implicit. This information may reflect users' preference towards the items in an implicit manner.

Option 1 - Simple Count: The most simple technique is to count times of interactions between user and item for producing affinity scores.

Option 2 - Weighted Count: It is useful to consider the types of different interactions as weights in the count aggregation. For example, assuming weights of the three differen types, "click", "add", and "purchase", are 1, 2, and 3, respectively.

Option 3 - Time-dependent Count: In many scenarios, time dependency plays a critical role in preparing dataset for building a collaborative filtering model that captures user interests drift over time. One of the common techniques for achieving time dependent count is to add a time decay factor in the counting.

A. Count

data_count = df.groupby(['userid', 'itemid']).agg({'timestamp': 'count'}).reset_index()
data_count.columns = ['userid', 'itemid', 'affinity']
data_count.head()

B. Weighted Count

data_w['weight'] = data_w['event'].apply(lambda x: affinity_weights[x])
data_wcount = data_w.groupby(['userid', 'itemid'])['weight'].sum().reset_index()
data_wcount.columns = ['userid', 'itemid', 'affinity']
data_wcount.head()

C. Time dependent Count

data_wt = data_w.groupby(['userid', 'itemid'])['timedecay'].sum().reset_index()
data_wt.columns = ['userid', 'itemid', 'affinity']
data_wt.head()

Train Test Split

Option 1 - Random Split: Random split simply takes in a data set and outputs the splits of the data, given the split ratios

Option 2 - Chronological split: Chronogically splitting method takes in a dataset and splits it on timestamp

data = data_w[['userid','itemid','timedecay','timestamp']]

col = {
  'col_user': 'userid',
  'col_item': 'itemid',
  'col_rating': 'timedecay',
  'col_timestamp': 'timestamp',
}

col3 = {
  'col_user': 'userid',
  'col_item': 'itemid',
  'col_timestamp': 'timestamp',
}

train, test = python_chrono_split(data, ratio=0.75, min_rating=10, 
                                  filter_by='user', **col3)

train.loc[train.userid==7,:]

test.loc[test.userid==7,:]

Experiments

Item Popularity Recomendation Model

baseline_recommendations = pd.merge(item_counts, users_items_remove_seen, 
                                    on=['itemid'], how='inner')
baseline_recommendations.head()

print("MAP:\t%f" % eval_map,
      "NDCG@K:\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall, sep='\n')

MAP:	0.005334
NDCG@K:	0.010356
Precision@K:	0.007092
Recall@K:	0.011395

Cornac BPR Model

all_predictions.head()

MAP:	0.004738
NDCG:	0.009597
Precision@K:	0.006601
Recall@K:	0.010597

SARS Model

from reco_utils.recommender.sar.sar_singlenode import SARSingleNode

TOP_K = 10

header = {
    "col_user": "userid",
    "col_item": "itemid",
    "col_rating": "timedecay",
    "col_timestamp": "timestamp",
    "col_prediction": "prediction",
}

model = SARSingleNode(
    similarity_type="jaccard", 
    time_decay_coefficient=0, 
    time_now=None, 
    timedecay_formula=False, 
    **header
)

model.fit(train)

Model:
Top K:		 10
MAP:		 0.024426
NDCG:		 0.032738
Precision@K:	 0.019258
Recall@K:	 0.036009

Spotlight

Implicit Factorization Model

interactions = Interactions(user_ids = df.userid.astype('int32').values,
                            item_ids = df.itemid.astype('int32').values,
                            timestamps = df.timestamp.astype('int32'),
                            num_users = df.userid.nunique(),
                            num_items = df.itemid.nunique())

train_user, test_user = random_train_test_split(interactions, test_percentage=0.2)

model = ImplicitFactorizationModel(loss='bpr', embedding_dim=64, n_iter=10, 
                                   batch_size=256, l2=0.0, learning_rate=0.01, 
                                   optimizer_func=None, use_cuda=False, 
                                   representation=None, sparse=False, 
                                   num_negative_samples=10)

model.fit(train_user, verbose=1)

pr = precision_recall_score(model, test=test_user, train=train_user, k=10)
print('Pricison@10 is {:.3f} and Recall@10 is {:.3f}'.format(pr[0].mean(), pr[1].mean()))

Epoch 0: loss 0.26659833122392174
Epoch 1: loss 0.06129162273462562
Epoch 2: loss 0.022607273167640066
Epoch 3: loss 0.013953083943443858
Epoch 4: loss 0.01050195922488137
Epoch 5: loss 0.009170394043447121
Epoch 6: loss 0.008144461540834697
Epoch 7: loss 0.007209992620171649
Epoch 8: loss 0.00663076309035038
Epoch 9: loss 0.006706491189820159
Pricison@10 is 0.007 and Recall@10 is 0.050

Implicit Factorization Model with Grid Search

CNN Pooling Sequence Model

interactions = Interactions(user_ids = df.userid.astype('int32').values,
                            item_ids = df.itemid.astype('int32').values+1,
                            timestamps = df.timestamp.astype('int32'))

train, test = random_train_test_split(interactions, test_percentage=0.2)
train_seq = train.to_sequence(max_sequence_length=10)
test_seq = test.to_sequence(max_sequence_length=10)

model = ImplicitSequenceModel(loss='bpr', representation='pooling', 
                              embedding_dim=32, n_iter=10, batch_size=256, 
                              l2=0.0, learning_rate=0.01, optimizer_func=None, 
                              use_cuda=False, sparse=False, num_negative_samples=5)

model.fit(train_seq, verbose=1)

mrr_seq = sequence_mrr_score(model, test_seq)
mrr_seq.mean()

Epoch 0: loss 0.4226887328702895
Epoch 1: loss 0.23515070266410953
Epoch 2: loss 0.16919970976524665
Epoch 3: loss 0.1425025990751923
Epoch 4: loss 0.12612225017586692
Epoch 5: loss 0.11565039795441706
Epoch 6: loss 0.10787886735357222
Epoch 7: loss 0.10086931410383006
Epoch 8: loss 0.09461003749585542
Epoch 9: loss 0.09128284808553633

0.10435609591957387

FastAI CollabLearner

learn.fit_one_cycle(1, 5e-6)

learn.summary()

EmbeddingDotBias
======================================================================
Layer (type)         Output Shape         Param #    Trainable 
======================================================================
Embedding            [50]                 534,000    True      
______________________________________________________________________
Embedding            [50]                 129,150    True      
______________________________________________________________________
Embedding            [1]                  10,680     True      
______________________________________________________________________
Embedding            [1]                  2,583      True      
______________________________________________________________________

Total params: 676,413
Total trainable params: 676,413
Total non-trainable params: 0
Optimized with 'torch.optim.adam.Adam', betas=(0.9, 0.99)
Using true weight decay as discussed in https://www.fast.ai/2018/07/02/adam-weight-decay/ 
Loss function : FlattenedLoss
======================================================================
Callbacks functions applied

learn.fit(10, 1e-3)

	event	userid	itemid	timestamp
0	view_item	2763227	11056	2020-01-13 16:05:31.244000+00:00
1	add_to_cart	2828666	14441	2020-01-13 22:36:38.680000+00:00
2	view_item	0620225789	14377	2020-01-14 10:54:41.886000+00:00
3	view_item	0620225789	14377	2020-01-14 10:54:47.692000+00:00
4	add_to_cart	0620225789	14377	2020-01-14 10:54:48.479000+00:00

	count	mean	std	min	25%	50%	75%	max
userid	99432.0	4682.814677	3011.178734	0.0	2507.0	3687.0	6866.0	11476.0
itemid	99432.0	1344.579964	769.627122	0.0	643.0	1356.0	1997.0	2633.0

	count	unique	top	freq	first	last
event	99432	5	begin_checkout	41459	NaT	NaT
timestamp	99432	61372	2020-01-16 04:21:49.377000+00:00	25	2020-01-13 16:05:31.244000+00:00	2020-03-10 13:02:21.376000+00:00

	userid	itemid	affinity
0	0	328	0.117590
1	1	1122	0.120232
2	1	1204	0.120232
3	1	1271	0.120232
4	1	1821	0.120232

	userid	itemid	timedecay	timestamp
16679	7	1464	0.019174	2020-01-16 06:42:31.341000+00:00
16691	7	1464	0.019174	2020-01-16 06:43:29.482000+00:00
16692	7	2109	0.019174	2020-01-16 06:43:42.262000+00:00
16694	7	1464	0.019174	2020-01-16 06:43:57.961000+00:00
16805	7	201	0.019174	2020-01-16 06:45:55.261000+00:00
16890	7	2570	0.019174	2020-01-16 06:54:12.315000+00:00
16999	7	2570	0.019174	2020-01-16 06:54:29.130000+00:00
17000	7	2570	0.057522	2020-01-16 06:54:35.097000+00:00

	userid	itemid	timedecay	timestamp
17001	7	1464	0.019174	2020-01-16 06:54:41.415000+00:00
17003	7	1464	0.057522	2020-01-16 06:54:44.195000+00:00

	userid	itemid	prediction
51214	7	2551	-0.438445
51215	7	481	2.522187
51216	7	1185	2.406107
51217	7	1766	1.112975
51218	7	1359	2.083620

epoch	train_loss	valid_loss	time
0	1.770657	1.751797	00:18
1	1.410351	1.528533	00:17
2	1.153979	1.399136	00:17
3	0.911953	1.326476	00:17
4	0.784223	1.279517	00:17
5	0.695546	1.248469	00:17
6	0.637151	1.230954	00:18
7	0.600011	1.216617	00:18
8	0.573309	1.209507	00:18
9	0.571132	1.204903	00:18

	itemid	count	userid
0	2564	461	7
1	2564	461	21
2	2564	461	73
3	2564	461	75
4	2564	461	113