Data Loading

df = pd.read_csv('rawdata.csv', header = 0,
                 names = ['event','userid','itemid','timestamp'],
                 dtype={0:'category', 1:'category', 2:'category'},
                 parse_dates=['timestamp'])
df.head()
event userid itemid timestamp
0 view_item 2763227 11056 2020-01-13 16:05:31.244000+00:00
1 add_to_cart 2828666 14441 2020-01-13 22:36:38.680000+00:00
2 view_item 0620225789 14377 2020-01-14 10:54:41.886000+00:00
3 view_item 0620225789 14377 2020-01-14 10:54:47.692000+00:00
4 add_to_cart 0620225789 14377 2020-01-14 10:54:48.479000+00:00
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99998 entries, 0 to 99997
Data columns (total 4 columns):
event        99998 non-null category
userid       99998 non-null category
itemid       99998 non-null category
timestamp    99998 non-null datetime64[ns, UTC]
dtypes: category(3), datetime64[ns, UTC](1)
memory usage: 1.7 MB

Wrangling

Removing Duplicates

df = df.drop_duplicates()

Label Encoding

userid_encoder = preprocessing.LabelEncoder()
df.userid = userid_encoder.fit_transform(df.userid)

# itemid normalization
itemid_encoder = preprocessing.LabelEncoder()
df.itemid = itemid_encoder.fit_transform(df.itemid)

Exploration

df.describe().T
count mean std min 25% 50% 75% max
userid 99432.0 4682.814677 3011.178734 0.0 2507.0 3687.0 6866.0 11476.0
itemid 99432.0 1344.579964 769.627122 0.0 643.0 1356.0 1997.0 2633.0
df.describe(exclude='int').T
count unique top freq first last
event 99432 5 begin_checkout 41459 NaT NaT
timestamp 99432 61372 2020-01-16 04:21:49.377000+00:00 25 2020-01-13 16:05:31.244000+00:00 2020-03-10 13:02:21.376000+00:00
df.timestamp.max() - df.timestamp.min()
Timedelta('56 days 20:56:50.132000')
df.event.value_counts()
begin_checkout      41459
view_item           35397
purchase             9969
add_to_cart          7745
remove_from_cart     4862
Name: event, dtype: int64
df.event.value_counts()/df.userid.nunique()
begin_checkout      3.612355
view_item           3.084168
purchase            0.868607
add_to_cart         0.674828
remove_from_cart    0.423630
Name: event, dtype: float64

User Interactions

Add-to-cart Event Counts

Purchase Event Counts

Item Interactions

Rule-based Approaches

def top_trending(n, timeperiod, timestamp):
  start = str(timestamp.replace(microsecond=0) - pd.Timedelta(minutes=timeperiod))
  end = str(timestamp.replace(microsecond=0))
  trending_items = df.loc[(df.timestamp.between(start,end) & (df.event=='view_item')),:].sort_values('timestamp', ascending=False)
  return trending_items.itemid.value_counts().index[:n]
user_current_time = df.timestamp[100]
top_trending(5, 50, user_current_time)
Int64Index([2241, 972, 393, 1118, 126], dtype='int64')

Top-N Least Viewed Items

def least_n_items(n=10):
  temp1 = df.loc[df.event=='view_item'].groupby(['itemid'])['event'].count().sort_values(ascending=True).reset_index()
  temp2 = df.groupby('itemid').timestamp.max().reset_index()
  item_ids = pd.merge(temp1,temp2,on='itemid').sort_values(['event', 'timestamp'], ascending=[True, False]).reset_index().loc[:n-1,'itemid']
  return itemid_encoder.inverse_transform(item_ids.values)
least_n_items(10)
array(['15742', '16052', '16443', '16074', '16424', '11574', '11465', '16033', '11711', '16013'], dtype=object)

Data Transformation

Many times there are no explicit ratings or preferences given by users, that is, the interactions are usually implicit. This information may reflect users' preference towards the items in an implicit manner.

Option 1 - Simple Count: The most simple technique is to count times of interactions between user and item for producing affinity scores.

Option 2 - Weighted Count: It is useful to consider the types of different interactions as weights in the count aggregation. For example, assuming weights of the three differen types, "click", "add", and "purchase", are 1, 2, and 3, respectively.

Option 3 - Time-dependent Count: In many scenarios, time dependency plays a critical role in preparing dataset for building a collaborative filtering model that captures user interests drift over time. One of the common techniques for achieving time dependent count is to add a time decay factor in the counting.

A. Count

data_count = df.groupby(['userid', 'itemid']).agg({'timestamp': 'count'}).reset_index()
data_count.columns = ['userid', 'itemid', 'affinity']
data_count.head()
userid itemid affinity
0 0 328 1
1 1 1122 1
2 1 1204 1
3 1 1271 1
4 1 1821 1

B. Weighted Count

data_w['weight'] = data_w['event'].apply(lambda x: affinity_weights[x])
data_wcount = data_w.groupby(['userid', 'itemid'])['weight'].sum().reset_index()
data_wcount.columns = ['userid', 'itemid', 'affinity']
data_wcount.head()
userid itemid affinity
0 0 328 6
1 1 1122 6
2 1 1204 6
3 1 1271 6
4 1 1821 6

C. Time dependent Count

data_wt = data_w.groupby(['userid', 'itemid'])['timedecay'].sum().reset_index()
data_wt.columns = ['userid', 'itemid', 'affinity']
data_wt.head()
userid itemid affinity
0 0 328 0.117590
1 1 1122 0.120232
2 1 1204 0.120232
3 1 1271 0.120232
4 1 1821 0.120232

Train Test Split

Option 1 - Random Split: Random split simply takes in a data set and outputs the splits of the data, given the split ratios

Option 2 - Chronological split: Chronogically splitting method takes in a dataset and splits it on timestamp

data = data_w[['userid','itemid','timedecay','timestamp']]

col = {
  'col_user': 'userid',
  'col_item': 'itemid',
  'col_rating': 'timedecay',
  'col_timestamp': 'timestamp',
}

col3 = {
  'col_user': 'userid',
  'col_item': 'itemid',
  'col_timestamp': 'timestamp',
}

train, test = python_chrono_split(data, ratio=0.75, min_rating=10, 
                                  filter_by='user', **col3)
train.loc[train.userid==7,:]
userid itemid timedecay timestamp
16679 7 1464 0.019174 2020-01-16 06:42:31.341000+00:00
16691 7 1464 0.019174 2020-01-16 06:43:29.482000+00:00
16692 7 2109 0.019174 2020-01-16 06:43:42.262000+00:00
16694 7 1464 0.019174 2020-01-16 06:43:57.961000+00:00
16805 7 201 0.019174 2020-01-16 06:45:55.261000+00:00
16890 7 2570 0.019174 2020-01-16 06:54:12.315000+00:00
16999 7 2570 0.019174 2020-01-16 06:54:29.130000+00:00
17000 7 2570 0.057522 2020-01-16 06:54:35.097000+00:00
test.loc[test.userid==7,:]
userid itemid timedecay timestamp
17001 7 1464 0.019174 2020-01-16 06:54:41.415000+00:00
17003 7 1464 0.057522 2020-01-16 06:54:44.195000+00:00

Experiments

Item Popularity Recomendation Model

baseline_recommendations = pd.merge(item_counts, users_items_remove_seen, 
                                    on=['itemid'], how='inner')
baseline_recommendations.head()
itemid count userid
0 2564 461 7
1 2564 461 21
2 2564 461 73
3 2564 461 75
4 2564 461 113
print("MAP:\t%f" % eval_map,
      "NDCG@K:\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall, sep='\n')
MAP:	0.005334
NDCG@K:	0.010356
Precision@K:	0.007092
Recall@K:	0.011395

Cornac BPR Model

all_predictions.head()
userid itemid prediction
51214 7 2551 -0.438445
51215 7 481 2.522187
51216 7 1185 2.406107
51217 7 1766 1.112975
51218 7 1359 2.083620
MAP:	0.004738
NDCG:	0.009597
Precision@K:	0.006601
Recall@K:	0.010597

SARS Model

from reco_utils.recommender.sar.sar_singlenode import SARSingleNode

TOP_K = 10

header = {
    "col_user": "userid",
    "col_item": "itemid",
    "col_rating": "timedecay",
    "col_timestamp": "timestamp",
    "col_prediction": "prediction",
}

model = SARSingleNode(
    similarity_type="jaccard", 
    time_decay_coefficient=0, 
    time_now=None, 
    timedecay_formula=False, 
    **header
)

model.fit(train)
Model:
Top K:		 10
MAP:		 0.024426
NDCG:		 0.032738
Precision@K:	 0.019258
Recall@K:	 0.036009

Spotlight

Implicit Factorization Model

interactions = Interactions(user_ids = df.userid.astype('int32').values,
                            item_ids = df.itemid.astype('int32').values,
                            timestamps = df.timestamp.astype('int32'),
                            num_users = df.userid.nunique(),
                            num_items = df.itemid.nunique())

train_user, test_user = random_train_test_split(interactions, test_percentage=0.2)

model = ImplicitFactorizationModel(loss='bpr', embedding_dim=64, n_iter=10, 
                                   batch_size=256, l2=0.0, learning_rate=0.01, 
                                   optimizer_func=None, use_cuda=False, 
                                   representation=None, sparse=False, 
                                   num_negative_samples=10)

model.fit(train_user, verbose=1)

pr = precision_recall_score(model, test=test_user, train=train_user, k=10)
print('Pricison@10 is {:.3f} and Recall@10 is {:.3f}'.format(pr[0].mean(), pr[1].mean()))
Epoch 0: loss 0.26659833122392174
Epoch 1: loss 0.06129162273462562
Epoch 2: loss 0.022607273167640066
Epoch 3: loss 0.013953083943443858
Epoch 4: loss 0.01050195922488137
Epoch 5: loss 0.009170394043447121
Epoch 6: loss 0.008144461540834697
Epoch 7: loss 0.007209992620171649
Epoch 8: loss 0.00663076309035038
Epoch 9: loss 0.006706491189820159
Pricison@10 is 0.007 and Recall@10 is 0.050

Implicit Factorization Model with Grid Search

CNN Pooling Sequence Model

interactions = Interactions(user_ids = df.userid.astype('int32').values,
                            item_ids = df.itemid.astype('int32').values+1,
                            timestamps = df.timestamp.astype('int32'))

train, test = random_train_test_split(interactions, test_percentage=0.2)
train_seq = train.to_sequence(max_sequence_length=10)
test_seq = test.to_sequence(max_sequence_length=10)

model = ImplicitSequenceModel(loss='bpr', representation='pooling', 
                              embedding_dim=32, n_iter=10, batch_size=256, 
                              l2=0.0, learning_rate=0.01, optimizer_func=None, 
                              use_cuda=False, sparse=False, num_negative_samples=5)

model.fit(train_seq, verbose=1)

mrr_seq = sequence_mrr_score(model, test_seq)
mrr_seq.mean()
Epoch 0: loss 0.4226887328702895
Epoch 1: loss 0.23515070266410953
Epoch 2: loss 0.16919970976524665
Epoch 3: loss 0.1425025990751923
Epoch 4: loss 0.12612225017586692
Epoch 5: loss 0.11565039795441706
Epoch 6: loss 0.10787886735357222
Epoch 7: loss 0.10086931410383006
Epoch 8: loss 0.09461003749585542
Epoch 9: loss 0.09128284808553633
0.10435609591957387

FastAI CollabLearner

learn.fit_one_cycle(1, 5e-6)
epoch train_loss valid_loss time
0 2.054070 2.029182 00:20
learn.summary()
EmbeddingDotBias
======================================================================
Layer (type)         Output Shape         Param #    Trainable 
======================================================================
Embedding            [50]                 534,000    True      
______________________________________________________________________
Embedding            [50]                 129,150    True      
______________________________________________________________________
Embedding            [1]                  10,680     True      
______________________________________________________________________
Embedding            [1]                  2,583      True      
______________________________________________________________________

Total params: 676,413
Total trainable params: 676,413
Total non-trainable params: 0
Optimized with 'torch.optim.adam.Adam', betas=(0.9, 0.99)
Using true weight decay as discussed in https://www.fast.ai/2018/07/02/adam-weight-decay/ 
Loss function : FlattenedLoss
======================================================================
Callbacks functions applied 
learn.fit(10, 1e-3)
epoch train_loss valid_loss time
0 1.770657 1.751797 00:18
1 1.410351 1.528533 00:17
2 1.153979 1.399136 00:17
3 0.911953 1.326476 00:17
4 0.784223 1.279517 00:17
5 0.695546 1.248469 00:17
6 0.637151 1.230954 00:18
7 0.600011 1.216617 00:18
8 0.573309 1.209507 00:18
9 0.571132 1.204903 00:18