Datasets

Shopping

Amazon: This dataset contains product reviews, only-rating data (ratings) and metadata(descriptions, category information, price, brand, and image features) from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
Epinions: This dataset was collected from Epinions.com, a popular online consumer review website. It contains trust relationships amongst users and spans more than a decade, from January 2001 to November 2013.
Yelp: This dataset was collected from Yelp.com. The Yelp dataset is a subset of our businesses, reviews, and user data for use in personal, educational, and academic purposes.
Tmall: This dataset is provided by Ant Financial Services, using in the IJCAI16 contest.
DIGINETICA: The dataset includes user sessions extracted from an e-commerce search engine logs, with anonymized user ids, hashed queries, hashed query terms, hashed product descriptions and meta-data, log-scaled prices, clicks, and purchases.
YOOCHOOSE: This dataset was constructed by YOOCHOOSE GmbH to support participants in the RecSys Challenge 2015.
Retailrocket: The data has been collected from a real-world ecommerce website. It is raw data, i.e. without any content transformations, however, all values are hashed due to confidential issues.
Ta Feng: The dataset contains a Chinese grocery store transaction data from November 2000 to February 2001.

Advertising

Criteo: This dataset was collected from Criteo, which consists of a portion of Criteo's traffic over a period of several days.
Avazu: This dataset is used in Avazu CTR prediction contest.
iPinYou: This dataset was provided by iPinYou, which contains all training datasets and leaderboard testing datasets of the three seasons iPinYou Global RTB(Real-Time Bidding) Bidding Algorithm Competition.

Check-in

Foursquare: This dataset contains check-ins in NYC and Tokyo collected for about 10 month. Each check-in is associated with its time stamp, its GPS coordinates and its semantic meaning.
Gowalla: This dataset is from a location-based social networking website where users share their locations by checking-in, and contains a total of 6,442,890 check-ins of these users over the period of Feb. 2009 - Oct. 2010.

Movies

MovieLens: GroupLens Research has collected and made available rating datasets from their movie web site.
Netflix: This is the official data set used in the Netflix Prize competition.
Douban: Douban Movie is a Chinese website that allows Internet users to share their comments and viewpoints about movies. This dataset contains more than 2 million short comments of 28 movies in Douban Movie website.

Music

Last.FM: This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system.
LFM-1b: This dataset contains more than one billion music listening events created by more than 120,000 users of Last.FM. Each listening event is characterized by artist, album, and track name, and includes a timestamp.
Yahoo Music: This dataset represents a snapshot of the Yahoo! Music community's preferences for various musical artists.

Books

Book-Crossing: This dataset was collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. It contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.

Games

Steam: This dataset is reviews and game information from Steam, which contains 7,793,069 reviews, 2,567,538 users, and 32,135 games. In addition to the review text, the data also includes the users' play hours in each review.

Anime

Anime: This dataset contains information on user preference data from myanimelist.net. Each user is able to add anime to their completed list and give it a rating and this dataset is a compilation of those ratings.

Pictures

Pinterest: This dataset is originally constructed by paper Learning image and user features for recommendations in social networks for evaluating content-based image recommendation, and processed by paper Neural Collaborative Filtering.

Jokes

Jester: This dataset contains anonymous ratings of jokes by users of the Jester Joke Recommender System.

Exercises

KDD2010: This dataset was released in KDD Cup 2010 Educational Data Mining Challenge, which contains the situations of students submitting exercises on the systems.

Websites

Phishing Websites: This dataset contains 30 kinds of features of 11,055 websites and labels of whether they are phishing websites or not. The websites' features includes 12 address-bar based features, 6 abnormal based features, 5 HTML-and-JavaScript based features and 7 domain based features.

Adult

Adult: This dataset is extracted by Barry Becker from the 1994 Census database, which consists of a list of people's attributes and whether they make over 50k a year.

News

MIND This dataset is a large-scale dataset for news recommendation research. It was collected from anonymized behavior logs of Microsoft News website. MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users.

Uncategorized

Douban This is the anonymized Douban dataset contains 129,490 unique users and 58,541 unique movie items.
Epinions Epinions is a website where people can review products.
Flixster Flixster is a social movie site allowing users to share movie ratings, discover new movies and meet others with similar movie taste.
CiaoDVD CiaoDVD is a dataset crawled from the entire category of DVDs from the dvd.ciao.co.uk website in December, 2013
MACLab With the text in the post, the mood tag, and the music title, this project is aimed at studying the users' moods and music emotions.
DEAPdataset A dataset for emotion analysis using eeg, physiological and video signals.
MyPersonalityDataset myPersonality was a popular Facebook application that allowed users to take real psychometric tests, and allowed us to record (with consent!) their psychological and Facebook profiles. Currently, our database contains more than 6,000,000 test results, together with more than 4,000,000 individual Facebook profiles.
Bibsonomy Tag Recommendations in Social Bookmarking Systems.
Delicious plista News Recommendation Dataset and Delicious.
Movielens Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.
Jester Anonymous Ratings from the Jester Online Joke Recommender System.
BookCrossing Book-Crossing Dataset.
LastFM 92,800 artist listening records from 1892 users.
Wikipedia Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries.
OpenStreetMap The files found here are complete copies of the OpenStreetMap.org database, including editing history. These are published under an Open Data Commons Open Database License 1.0 licensed. For more information.
PythonGitCode Hermes is Lab41's foray into recommender systems. It explores how to choose a recommender system for a new application by analyzing the performance of multiple recommender system algorithms on a variety of datasets.
Gist Recommendation and Ratings Public Data Sets For Machine Learning.
Yelp The Yelp dataset is a subset of our businesses, reviews, and user data for use in personal, educational, and academic purposes. Available in both JSON and SQL files, use it to teach students about databases, to learn NLP, or for sample production data while you learn how to make mobile apps
AmazonReviews This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014. This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
CiteULike The CiteULike database is potentially useful for researchers in various fields. Physicists and computer scientists have expressed an interest in trying to analyse the structure of the data, and frequently ask for datasets to be made available. Previously this was done on an ad-hoc basis, and it relied on us remembering to update the data file. Now, there is an automatic process which runs every night producing a snapshot summary of what articles have been posted with which tags.
Taobao The data set contains anonymized users' shopping logs in the past 6 months before and on the "Double 11" day,and the label information indicating whether they are repeated buyers. Due to privacy issue, data is sampled in a biased way, so the statistical result on this data set would deviate from the actual of Tmall.com.

Statistics

Data Set	Users	Items	Ratings (Scale)	Density	Users	Links (Type)
Ciao	7,375	99,746	278,483--[1, 5]	0.0379%	7,375	111,781--Trust
Douban	129,490	58,541	16,830,839--[1, 5]	0.222%	129,490	1,692,952--Friendship
Epinions (665K)	40,163	139,738	664,824--[1, 5]	0.0118%	49,289	487,183--Trust
Epinions (510K)	71,002	104,356	508,960--[1, 5]	0.00687%		Trust
Epinions (Extended)	120,492	755,760	13,668,320--[1, 5]	0.015%		Trust Distrust
Flixster	147,612	48,794	8,196,077--[0.5, 5.0]	0.1138%	787,213	11,794,648--Friendship
FilmTrust	1,508	2,071	35,497--[0.5, 4.0]	1.14%	1,642	1,853--Trust
Jester	59,132	140	1,761,439--Explicit	21.28%
MovieLens 100K	943	1,682	100,000--[1, 5]	6.30%
MovieLens 1M	6,040	3,706	1,000,209--[1, 5]	4.47%
MovieLens 10M	71,567	10,681	10,000,054--[1, 5]	1.308%

SN	Dataset	#User	#Item	#Inteaction	Sparsity	Interaction Type
1	MovieLens	-	-	-	-	Rating
2	Anime	73,515	11,200	7,813,737	99.05%	Rating [-1, 1-10]
3	Epinions	116,260	41,269	188,478	99.99%	Rating [1-5]
4	Yelp	1,968,703	209,393	8,021,122	99.99%	Rating [1-5]
5	Netflix	480,189	17,770	100,480,507	98.82%	Rating [1-5]
6	Book-crossing	105,284	340,557	1,149,780	99.99%	Rating [0-10]
7	Jester	73,421	101	4,136,360	44.22%	Rating [-10, 10]
8	Douban	738,701	28	2,125,056	89.73%	Rating [0,5]
9	Yahoo Music	1,948,882	98,211	11,557,943	99.99%	Rating [0, 100]
10	KDD2010	-	-	-	-	Rating
11	Amazon	-	-	-	-	Rating
12	Pinterest	55,187	9,911	1,445,622	99.74%	-
13	Gowalla	107,092	1,280,969	6,442,892	99.99%	Check-in
14	LastFM	1,892	17,632	92,834	99.72%	Click
15	DIGINETICA	600,684	184,047	993,483	99.99%	Click
16	Steam	2,567,538	32,135	7,793,069	99.99%	Buy
17	Ta Feng	32,266	23,812	817,741	99.89%	Click
18	Foursquare	-	-	-	-	Check-in
19	Tmall	963,923	2,353,207	44,528,127	99.99%	Click/Buy
20	YOOCHOOSE	9,249,729	52,739	34,154,697	99.99%	Click/Buy
21	iPinYou	12,931,430	131	15,367,312	99.09%	View/Click
22	Retailrocket	1,407,580	247,085	2,756,101	99.99%	View/Addtocart/Transaction
23	LFM-1b	120,322	3,123,496	1,088,161,692	99.71%	Click
24	Criteo	-	-	45,850,617	-	Click
25	Avazu	-	-	40,428,967	-	Click [0, 1]
26	Phishing Websites	-	-	11,055	-
27	Adult	-	-	32,561	-	income>=50k [0, 1]
28	MIND	-	-	-	-	Click

Data Sources

Google Drive
Baidu Wangpan (Password: e272)

References

RecBole
https://github.com/RUCAIBox/RecSysDatasets

Appendix

Table format - Google Sheet